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Jonathan T. Kaplan, being duly warned that willful false statements and the like so 
made are punishable by fine or imprisonment, or both, under Section 1001 of Title 18 of the 
United States Code, and that such willful false statements may jeopardize the validity of the 
application or any patent resulting therefrom, declares that: 



1. I am registered to practice before the Office in patent cases and am attorney of record 
herein. I am familiar with all aspects of the preparation and filing of the application 
herein. 



2. This is an application for broadening reissue of a previously-issued United States 
patent. As such it neither discloses nor claims anything not disclosed in the original 
application for patent. 

3. On 3 1 July 2000, as part of the process of completing formalities associated with the 
filing of the application herein, I caused to be delivered to each of the four inventors 
named herein a copy of the specification and a statutory declaration stating that the 
inventors were first, original, and joint inventors of the subject matter described and 
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claimed in the application. In each case the specification and declaration were 
accompanied by a cover letter requesting that the addressee review the application and 
sign and return the declaration. These letters and their enclosures were sent by 
Federal Express "priority overnight" service, signature required. 

4. By 1 8 August I received from three of the four inventors, namely Tai A. Ly, Donald 
B. Macmillen, and Ronald A. Miller, either signed declarations or clear oral 
indications that the declaration would be signed and returned. By 3 1 August 2000 I 
received signed declarations from each of these three inventors. 

5. My letter to the fourth inventor, Mr. David W. Knapp, was not answered. This letter 
and its enclosures were delivered by Federal Express to Mr. Knapp 's place of 
business on 1 August 2000. Acknowledgement of the delivery was made on 
behalf of Mr. Knapp by a B. Ngyen. A copy of my letter of 3 1 July 2000 to Mr. 
Knapp, including copies of the enclosures to that letter and an electronic copy of B. 
Ngyen' s signature acknowledging the letter's receipt, is attached hereto as Exhibit 1. 

6. Mr. Knapp' s failure to respond to my letter and request of 3 1 July 2000 was not 
surprising to me. I had previously received information indicating that Mr. Knapp is 
now employed by a competitor of the assignee of the patent which is the subject of 
this reissue application, and that in fact Mr. Knapp's current employer may now be 
engaged in conduct which would constitute infringement of the reissued patent. 

7. Having received no response to my letter of 3 1 July 2000, 1 caused to be sent to Mr. 
Knapp a second letter on 1 8 August 2000. This second letter referred to and reiterated 
the contents of the first and, like the first, enclosed a copy of the specification herein 
and a statutory declaration for Mr. Knapp's signature. This letter and its enclosures 
were again delivered by Federal Express to Mr. Knapp's business address. Delivery 
was completed on 21 August. Acknowledgement of the delivery was made on Mr. 
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Knapp's behalf by an E. Castor. A copy of my letter of 18 August 2000 to Mr. Knapp, 
including copies of the enclosures to that letter and an electronic copy of E. Castor's 
signature acknowledging the letter's receipt, is attached hereto as Exhibit 2. 

8. On or about 1 September 2000 I received a letter from Mr. Knapp in response to my 
letter of 3 1 July 2000. This letter bears Mr. Knapp's signature and states that Mr. 
Knapp will neither sign the declaration nor otherwise cooperate in prosecuting the 
application. A copy of Mr. Knapp's letter is attached hereto as Exhibit 3. 

9. The facts set forth in this Declaration are true, and all statements made of my own 
knowledge are true, and all statements made on information and belief are believed to 



HOWREY SIMON ARNOLD & WHITE, LLP 
Box No. 34 

1299 Pennsylvania Avenue, N.W. 
Washington, D.C. 20004-2402 
(202) 783-0800 
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be true. 



Dated: November 9. 2001 




Jonathan T. Kaplan 
Registration No. 38,935 
Attorney for Assignee 



THIS PAGE BLANK (uspto) 



t f ' PATENT 

In re Patent No. 5,764,951 to LY et al. 
Application Serial No. 09/590,584 
Attorney Docket 068 1 6.00 1 0 



EXHIBIT 1 
TO 

DECLARATION OF JONATHAN T. KAPLAN 
AND STATEMENT OF FACTS IN SUPPORT OF FILING 
ON BEHALF OF NON-SIGNING INVENTOR 

Pursuant to 37 CFR 1.47 



Declaration of Jonathan T. Kaplan 



4 




THIS PAGE BLANK (uspto) 




Express 



8/4/2000 



FedEx Express 
Customer Support Trace 
3875 Airways Boulevard 
Modul H. 4th Floor 
Memphis, TN 38116 



U.S. Mail: PO Box 727 
Memphis, TN 38194-4643 

Telephone: 901-369-3600 



Dear Customer: 
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Delivery Information- 
Signed For By: B.NGYEN 




Delivery Location: 2107 N 1ST ST 350 
Delivery Date: August 1, 2000 
Delivery Time: 1019 

Shipping Information: 

Tracking No: 820573308506 Ship Date: July 31 . 2000 



Recipient: 

MR DAVID W KNAPP 
GET 2 CHIP COM 
2107 N FIRST ST 350 
SAN JOSE. CA 95131 
US 



Shipper: 

KAPLAN JONATHAN T 
BROWN RAYSMAN ET AL 
120W45TH ST FL 21 
NEW YORK, NY 100364041 



Shipment Reference Information: 



4000/10 



Thank you for choosing FedEx Express. We look forward to working with you in the future. 

FedEx Worldwide Customer Service 
1-800-Go-FedEx® 

Reference No.: R200008040001 92031 23 



http://www.fedex.coni/cgi-bin/spod 
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Partner 
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31 July 2000 

VIA FEDERAL EXPRESS 

fSIGNATURE REQUIRED ON RECIEPT^ RECEIVED 

Mr. David W. Knapp 
Chief Technical Officer 
Get2Chip.com, Inc. 
2107 North First Street, Suite 350 
San Jose, California 95131 



FEB I 3 2002 
OFFICE OF PETITIONS 



Re: Reissue of U.S. Patent 5,764,95 1 to Ly et al. 

METHODS FOR AUTOMATICALLY PIPELINING LOOPS 
Our File Ref 4000/10 

Dear Mr. Knapp: 

This law firm represents your former employer Synopsys, Inc. (hereinafter "Synopsys"). 
Synopsys is currently seeking reissuance of the above-identified patent in the United States 
Patent and Trademark Office. Enclosed please find a Declaration of Inventor which we will be 
filing in the Reissue Patent Application. Also enclosed is a copy of the Reissue Application, 
which adds new claims 23-34 to those which already issued in U.S. Patent 5,764,95 1 . Please 
review the Declaration and the new claims carefully. 

Assuming you understand and agree with the Declaration, Synopsys requests that you do 
the followmg. Please complete the Declaration by entering, in the boxes at the end of the 
Declaration, your country of citizenship and the address of your current home residence Once 
you have completed the Declaration, please sign the Declaration in the indicated box at the end 
of the Declaration. Return the Declaration to us by means of the enclosed self-addressed 
stamped envelope. 



IT IS VERY IMPORTANT THAT YOU RETURN THE DECLARATION TO US 
BY SEPTEMBER 1, 2000. IF YOU HAVE ANY QUESTIONS ABOUT THE 
DECLARATION, IT IS EXTREMELY IMPORTANT THAT YOU CONTACT US AS 
SOON AS POSSIBLE SO THAT WE MAY ANSWER YOUR QUESTIONS. 

As I am sure you will recall, you have previously assigned all rights in the invention to 
Synopsys as your fonner employer. 

BRMFSlN2fi7»26yl>RK HARTFORD LOS ANCELES NEW JERSEY 
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Mr. David W. Knapp 
Get2Chip.com, Inc. 
31 July 2000 
Page 2 



Your cooperation in this matter is greatly appreciated. On behalf of Synopsys, Inc., I 
thank you most sincerely for your help. 

Very truly yours, 



Jonathan T. Kaplan 

JTK/mjm 

Enclosures Declaration Of Inventor (w/ self-addressed stamped envelope) 
Reissue Application 
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Reissue Appl. No. 



09/590,584 
June 8, 2000 
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RECEIVED 



Filed 



FFB 1 3 2002 



In re Patent to 



Patent No. 



OFFICE OF PETITIONS 



Issued 



June 9, 1998 

METHODS FOR AUTOMATICALLY PIPELINING LOOPS 



Title 



Box REISSUE 

Assistant Commissioner for Patents 
Washington, D.C. 20231 

DECLARATION OF INVENTOR DAVID W. KNAPP IN APPLICATION 
FOR BROADENING REISSUE OF PATENT 
Pursuant to 37 CFR 1.63 and 1.175 

This declaration is made in appUcation for broadening reissue of the above-identified 

patent. 

As a below named inventor, I hereby declare that: 

My residence, post office address and citizenship are as stated below next to my name. 

I believe I am an original, first and joint inventor of the subject matter which is claimed and for 
which a patent is sought on the invention entitled 

~~ METHODS FOR AUTOMATICALLY PIPELINING LOOPS 



the specification of which: (check one) 
Q is attached hereto; or 

13 was filed on June 8, 2000 as U.S. Reissue Application Serial No. 09/590,584. 

I hereby state that I have reviewed and understand the contents of the above identified 
specification, including the claims, as amended by any amendment referred to above. 

I acknowledge the duty to disclose information which is material to patentability as defined in Title 
37, Code of Federal Regulations, §1.56. 
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Application Serial No. 09/590,584 
In re Patent No. 5,764,951 
Patentees: Lyetal. 
Attorney Docket No. 4000/10 

I believe the original patent to be wholly or partly inoperative or invalid, for the reasons described 
below: 

I I by reason of a defective specification or drawing. 

^ by reason of the patentee claiming more or less than he had the right to claim in the patent, 
I I by reason of other errors. 

At least one error upon which this application for reissue is based is described as follows: 

The limitation of claim 1 to methods comprising steps of parsing text descriptions including loops 
with delayed signal assignments having delay values and setting latencies of pipelines equal to said 
delay values is more limiting than necessary, and resulted in the patentee claiming less than he had 
a right to claim. 

The limitation of claim 18 to systems comprising logic for parsing text descriptions including loops 
with delayed signal assignments having delay values and setting latencies of pipelines equal to said 
delay values is more limiting than necessary, and resulted in the patentee claiming less than he had 
a right to claim. 

The limitation .of claim 21 to computer program products comprising computer readable program 
code devices configured to cause a computer effect parsing of text descriptions including loops 
with delayed signal assignments having delay values and setting of latencies of pipelines equal to 
said delay values is more limiting than necessary, and resulted in the patentee claiming less than he 
had a right to claim. 

All errors being corrected in this reissue application arose without any deceptive intention on the 
part of the applicant. 

I hereby declare that all statements made herein of my ovra knowledge are true and that all 
statements made on information and belief are believed to be true; and further that these 
statements were made with the knowledge that willful false statements and the like so made are 
punishable by fine or imprisonment, or both, under Section 1001 of Title 18 of the United States 
Code and that such willful false statements may jeopardize the validity of the application or any 
patent issued thereon. 



Full Name of First Joint Inventor David W. Knapp 


Inventor's Signature 




Date 


Residence 




Citizenship 


Post Office Address 
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METHODS FOR AUTOMATICALLY PIPELINING LOOPS 

Related Applications 

This application is related to U.S. patent application Sen No. 08/440,101 entitled 
"Behavioral Synthesis Links to Logic synthesis" with inventors Ronald A, Miller, 
5 Donald B. MacMillen, Tai A. Ly and David Knapp filed on May 12, 1995, which is 
hereby incorporated by reference. 

Background 
Field of the Invention 

This invention relates to the field of computer aided design for digital circuits, 
10 particularly to automatically pipelining loops in a behavioral synthesis system. 

Statement of the Related Art 

Behavioral Synthesis 

Behavioral vs. Register Transfer Level Design 

Many of today's integrated circuits are described using a Hardware Description 

15 Language (HDL). Two common HDL's are VHDL and Verilog. VHDL is described in 
the IEEE Standard VHDL Language Reference Manual available from the Institute of 
Electrical and Electronic Engineers in Piscataway, New Jersey which is hereby 
incorporated by reference. Verilog is described in The Verilog Hardware Description 
Language by Donald E. Thomas and Philip Moorby, Kluwer Academic Publishers, 

20 1991 which is hereby incorporated by reference. 

As integrated circuits become increasingly complex, hardware designers are 
increasingly using synthesis software to transform HDL descriptions of digital circuits 
into mapped logic. The designer writes a description of a digital circuit in VHDL, 
Verilog, or another HDL, and uses synthesis software to create a digital circuit from the 

25 description. Using synthesis software typically shortens the amount of time required to 
create a digital circuit from a design specification, and allows a designer to create more 
complex designs than is possible manually. 

Many of today's complex designs are expressed as software descriptions and 
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simulated to verify their correctness. These designs are later translated from software 
into hardware, in the form of Integrated Circuits (ICs), Application Specific Integrated 
Circuits (ASICs), or Field Programmable Gate Arrays (FPGAs), for implementation in 
the final product. This design description methodology is called algorithmic-level 
5 design. 

Instead of beginning design at the Register Transfer Level (RTL), behavioral 
synthesis begins at the algorithmic (behavioral) level. RTL level design is described in 
Computer Structures: Reading and Examples by C. Gorden Bell and Allen Newell, 
McGraw-Hill 1971. A behavioral hardware description language (HDL) specification 
10 contains instructions, operations, variables, and arrays similar to the original software 
algorithm. 

The target architecture of behavioral synthesis is a general computing model that 
contains datapath, memory, and control elements. Conventional design techniques 
currently use a manual RTL design methodology to build a datapath. A datapath is a 

15 sequence of logic consisting of registers, higher order functional units (such as adders 
and multipliers), and multiplexers. The datapath in a digital circuit uses the circuit's 
inputs to compute output results. Registers are 1-bit memory elements which hold their 
value through each clock cycle. 

Conventional design techniques also build a controller at the RTL to sequence 

20 and control the actions of the datapath, memory, and Input/Output (I/O). Frequently, 
such controllers are implemented using a Finite State Machine (FSM). Finite state 
machines are described in Switching and Finite Automata Theorv by Zvi Kohavi, 
Computer Science Press, 1978 which is hereby incorporated by reference. Controllers 
may also determine actions such as which branch of a conditional statement is executed. 

25 Behavioral synthesis builds this architecture by using automated methods of 

scheduling, allocation, register sharing, memory and control inferencing~all of which 
are performed manually in an RTL methodology. The designer is freed from having to 
specify the exact architecture of a design and can automatically explore many 
implementations to find the optimal architecture. 
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Components of Behavioral Synthesis 

The High-L evel Synthesis of Digital Systems by Michael McFarland, Alice 
Parker, and Raul Camposano, in Proceedings of the IEEE, February 1990, which is 
hereby incorporated by reference, provides an excellent overview of High Level 
5 Synthesis, as Behavioral Synthesis is often called. 

Three components of a behavioral synthesis system are Scheduling, Allocation, 
and Resource Sharing. 

Scheduling determines in which clock cycle each operation executes. 
Scheduling extracts the control and data flow operations of a design specification and 

10 assigns these operations to cycles. A state machine controller is synthesized to sequence 
the operations and execute them in their assigned cycle. The typical goal of this process 
is to assign operations to cycles so as to be able to implement the design with the fewest 
resources (registers, multiplexers, and operations) while at the same time minimizing 
the number of clock cycles (latency). 

15 Allocation is a behavioral synthesis task that maps the operations and data of a 

behavioral HDL specification into the datapath, which contains memories, registers, 
functional units such as adders and multiplexers, and gates. Allocation determines 
which type of operation to use for each operator. For instance, if an operator performs 
addition, a ripple cany, a carry-lookahead, or some other type of adder can be used. 

20 Resource Sharing attempts to share hardware resources between operators in a 

design. For example, consider two additions which occur in mutually exclusive 
conditional branches. Such additions will never be performed at the same time. Thus, 
they can be performed on the same piece of hardware. Resource sharing attempts to 
minimize the amount of hardware used by sharing hardware as much as possible. 

25 Scheduling Modes 

There are several modes for automatically scheduling operations into control 
steps. Briefly, these modes are cycle-fixed, superstate-fixed, and free-floating mode. In 
cycle-fixed mode, all I/O operations are constrained to occur in the same cycle in the 
original HDL descriptions and in the synthesized design. In cycle-fixed mode, the cycle 
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level behavior of the synthesized circuit must match the cycle level simulation behavior 
of the source HDL. 

The other scheduling modes allow behavioral synthesis a greater degree of 
freedom in assigning states in a schedule. Scheduling modes are discussed funher in 
5 Behavioral Synthesis Methodoloev for HDL-Based Specification and Validation by D. 
Knapp, T. Ly, D. MacMillen and R. Miller in Proceedings of the 31st DAC, June 1995, 
u^hich is included as Appendix B and is hereby incorporated by reference. They are also 
discussed in Behavioral Compiler User Guide Version 3.2a available fi-om Synopsys, 
Inc. in Mountain View, Calif., which is hereby incorporated by reference. 

10 Loop Pipelining 

In behavioral HDL, a loop repeatedly executes the operations in the loop body 
until an exit condition becomes true. Loop iterations are usually sequential; operations 
in the first iteration are executed, operations in the next iteration are executed, and so 
on, as shown in Figure 1. The throughput, that is the amount of data processed per unit 

15 time, of the function implemented by the loop body is limited by the critical path in the 
loop body. 

In some loops, data required by an operation in the next loop iteration is 
available prior to completion of the current loop. Under these conditions, the designer 
cm pipeline the loop-parallelizing execution of iterations to increase throughput 
20 beyond critical path limitations of the loop body. This process of loop pipelining 

schedules consecutive loop iterations to partially overlap in time; a new loop iteration is 
initiated before the current iteration has finished. 

Figure 2 shows an example of loop pipelining where the data required by 
operation A in iteration two is available after operation C in the first loop iteration. 
25 The two timing-related aspects of a loop that affect throughput are: 

Initiation interval: The number of clock cycles between the start of two 
consecutive loop iterations. 

Latency: The number of clock cycles required to execute all operations in a 
single loop iteration. 
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For sequential loops that are not pipelined, the initiation interval and latency of j 
loop are the same. For a pipelined loop, the initiation interval is smaller than the 
latency. 

The primary reason for using loop pipelining is to increase the throughput of the 
5 design; the trade-off is that the design area usually increases. 

Many designs have separate specifications on throughput and input-to-output 
delay. The throughput specification constrains the initiation interval. The input-to- 
output delay specification constrains the loop latency. Loop pipelining enables a 
flexible relationship between the initiation interval and latency of a loop. 
10 An example of a candidate for loop pipelining is a design that processes a data 

stream. This type of design often has tight throughput requirements based on the rate of 
the data streams and loose input-to-output delay constraints. 
Loop Carry Dependencies 

Loop Carry Dependencies (LCDs) are data values produced in one iteration of a 
15 loop and consumed by operations in subsequent iterations. 

In loop pipelining, loop iterations that are producers and consumers of LCDs 
can happen at the same time. To preserve data dependencies, the operations in a loop 
must be scheduled so that LCD values are available in time for the iteration in which 
they will be consumed. Two schedules for a LCD are shown in Figure 3. 
20 The example of Figure 3(a) violates the LCD. Operation 410 is scheduled so that 

its output is not ready in time for operation 420 to use it in the next iteration of the loop. 
The example of Figure 3(b) is scheduled correctly. In this case, operation 410 is 
scheduled so that its output is ready in time for operation 420 to execute in the next 
iteration of the loop. 
25 Memory and I/O Accesses 

Loop Pipelining must preserve the original ordering of all reads and writes to the 
same memory, signal, or port. In addition, the ordering reads and writes in one iteration 
of the loop may not "cross," or occur after, reads and writes in subsequent iterations of 
the loop. Specifically, all reads and writes to the same memory, and all writes to the 
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same signal or port in one iteration of the loop must occur before any reads or writes to 
the same memory, signal or port in a subsequent iteration of the loop. All reads of the 
same signal or port must occur simultaneously to or before any read of the same signal 
or port in a subsequent iteration of the loop. 

For example, Figure 4 shows two schedules for a loop that has two reads of 
signal X. In FIGURE 4(a), read 510 and read 520 are improperly scheduled. Read 520 
occurs after read 510 occurs in the next iteration of the loop. In FIGURE 4(b), read 510 
and read 520 are properly scheduled. In this schedule, read 520 occurs after read 510 in 
the next iteration of the loop. 



I 
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A Brief Description of the Drawings 

The accompanying drawings, which are incorporated in and constitute a part of 
this specification, illustrate several embodiments of the invention and, together with the 
description, serve to explain the principles of the invention. 
5 Figure 1 shows an example of sequential loop processing. 

Figure 2 shows an example of pipelined loop processing including the loop 
latency and initiation interval. 

Figure 3 shows an example of a loop carry dependency. 

Figure 4 shows an example of memory and I/O access restrictions in pipelined 

10 loops. 

Figure 5 is a block diagram showing a computer system. 
Figure 6 is a flowchart which shows steps in a circuit synthesis process. 
Figure 7 is a flowchart which shows steps for scheduling preprocessing. 
Figure 8 is a flowchart which shows steps for inserting constraints into a 
1 5 constraint graph. 

Figure 9 is a flowchart which shows steps for scheduling templates. 
Figure 10 is a flowchart which shows steps for creating a constraint using 
templates. 

Figure 1 1 shows HDL source code which contains a loop with a producer and a 
20 consumer. 

Figure 12 shows a circuit before scheduling which is created from loop 3030 of 
Figure 11. 

Figure 13 shows a constraint created for a producer and consumer in loop 3030. 
Figure 14 shows a circuit which is created after scheduling loop 3030 using an 
25 initiation interval of 2 and a latency of 4. 

Figure 15 shows Verilog HDL source code which contains a loop with I/O 
dependencies. 

Figure 16 shows a circuit before scheduling which is created from loop 1530 of 
Figure 15. 
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Figure 17 shows a constraint created for two reads in loop 1530. 

Figure 18 shows a circuit which is created after scheduling loop 1530 using an 
initiation interval of 2 and a latency of 4. 

Figure 19 (a) and Figure 19 (b) are examples of HDL source code including a 
delay clause. 

Figure 20 is a flowchart showing steps performed during translation from the 
source code of Figure 19 (a) and Figure 19 (b) to a circuit design that incorporates a 
delay specified by the delay clause. 

Figure 21 is a representation of a data flow graph generated from the source 
code of Figure 19 (a) and (b) in accordance with the steps of Figure 20. 

Figure 22 is a representation of a control flow graph generated from the source 
code of Figure 19 (a) and (b) and the data flow graph of Figure 21. 

Figure 23 is a flow chart showing steps performed to generate a control data 
flow graph from the control flow graph and data flow graph of Figure 21 and Figure 22. 

Figure 24 is a representation of a control data flow graph generated by the steps 
of Figure 23. 

Figure 25 is a diagram showing an example of loop tiling with and without the 
delay in the HDL. 

Figure 26 is a diagram showing the effect of the delay clause on pipelining. 
Figure 27 shows the operations of Figure 12 scheduled into control steps. 
Figure 28 shows the read operations of Figure 16 scheduled into control steps. 
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Detailed Description of the Invention 

The present invention is a method and apparatus for synthesizing a circuit which 
implements a pipelined loop from a Hardware Description Language (HDL) 
description. The following description is presented to enable any person skilled in the 
5 art to make and use the invention, and is provided in the context of a particular 

application and its requirements. Various modifications to the preferred embodiment 
will be readily apparent to those skilled in the art, and the generic principles defined 
herein may be applied to other embodiments and applications without departing from 
the spirit and scope of the invention. Thus, the present invention is not intended to be 

10 limited to the embodiment shown, but is to be accorded the widest scope consistent with 
the principles and features disclosed herein. 
1.0 Computer System Description 

Figure 5 illustrates a computer system 100 in accordance with a preferred 
embodiment of the present invention. The computer system 100 includes a bus 101, or 

15 other communications hardware and software, for communicating information, and a 
processor 109, coupled with the bus 101, is for processing information. The processor 
109 can be a single processor or a number of individual processors that can work 
together. The computer system 100 further includes a memory 104. The memory 104 
can be random access memory (RAM), or some other dynamic storage device. The 

20 memory 104 is coupled to the bus 101 and is for storing information and instructions to 
be executed by the processor 109. The memory 104 also may be used for storing 
temporary variables or other intermediate information during the execution of 
instructions by the processor 109. The computer system 100 also includes a ROM 106 
(read only memory), and/or some other static storage device, coupled to the bus 101. 

25 The ROM 106 is for storing static information such as instructions or data. 

The computer system 100 can optionally include a data storage device 107, such 
as a magnetic disk, a digital tape system, or an optical disk and a corresponding disk 
drive. The data storage device 107 can be coupled to the bus 101. 

The computer system 100 can also include a display device 121 for displaying 
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information to a user The display device 121 can be coupled to the bus 101. The 
display device 121 can include a frame buffer, specialized graphics rendering devices, a 
cathode ray tube (CRT), and/or a flat panel display. The bus 101 can include a separate 
bus for use by the display device 121 alone. 
5 An input device 122, including alphanumeric and other keysj is typically 

coupled to the bus 101 for communicating information, such as command selections, to 
the processor 109 from a user. Another type of user input device is a cursor control 123, 
such as a mouse, a trackball, a pen, a touch screen, a touch pad, a digital tablet, or 
cursor direction keys, for communicating direction information to the processor 109, 

10 and for controlling the cursor's movement on the display device 121. The cursor control 
123 typically has two degrees of freedom, a first axis (e.g., x) and a second axis (e.g., 
y), which allows the cursor control 123 to specify positions in a plane. However, the 
computer system 100 is not limited to input devices with only two degrees of freedom. 
Another device which may be optionally coupled to the bus 101 is a hard copy 

15 device 124 which may be used for printing instructions, data, or other information, on a 
medium such as paper, film, slides, or other types of media. 

A sound recording and/or playback device 125 can optionally be coupled to the 
bus 101. For example, the sound recording and/or playback device 125 can include an 
audio digitizer coupled to a microphone for recording sounds. Further, the sound 

20 recording and/or playback device 125 may include speakers which are coupled to digital 
to analog (D/A) converter and an amplifier for playing back sounds. 

A video input/output device 126 can optionally be coupled to the bus 101. The 
video input/output device 126 can be used to digitize video images from, for example, a 
television signal, a video cassette recorder, and/or a video camera. The video 

25 input/output device 126 can include a scanner for scanning printed images. The video 
input/out-put device 126 can generate a video signal for, for example, display by a 
television. 

Also, the computer system 100 can be part of a computer network (for example, 
a LAN) using an optional network connector 127, being coupled to the bus 101. In one 
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embodiment of the invention, an entire network can then also be considered to be part 
of the computer system 100. 

An optional device 128 can optionally be coupled to the bus 101 . The optional 
device 128 can include, for example, a PCMCIA card and a PCMCIA adapter. The 
5 optional device 128 can further include an optional device such as modem or a wireless 
network connection. 
2.0 Deflnitions 

A digital circuit is an interconnected collection of parts. Parts may also be called 
cells. The digital circuit receives signals from external sources at points called primary 

10 inputs. The digital circuit produces signals for external destinations at points called 
primary outputs. Primary inputs and primary outputs are also called ports. Each part 
receives input signals and computes output signals. Each part has one or more pins for 
receiving input signals and producing output signals. In general, pins have a direction. 
Most pins are either input pins, which are called loads, or output pins, which are called 

15 drivers; Some pins may be bidirectional pins, which can be both drivers and loads. 

Two or more pins from one or more parts or primary inputs or primary outputs 
are connected together with a net. Each net establishes an electrical connection among 
the connected pins, and allows the parts to interact electrically with each other. Pins are 
also connected to primary inputs and primary outputs with nets. For the sake of 

20 simplicity, parts may be said to be "connected" to nets, but it is actually pins on the 
parts which are connected to the nets. 

A Circuit Element is any component of a circuit. Ports, pins, nets, and cells are 
all circuit elements. Any circuit element which is an input to another circuit element is 
said to drive that circuit element. Any circuit element which is an output of another 

25 circuit element is said to load that circuit element. For example, drivers drive a signal 
onto a net; loads load nets with capacitance. 

A digital circuit design can be stored in memory of a computer system using 
data structures which represent the various components of the circuit. The data 
structures have the same name as the physical components. In this document, parts. 
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cells, nets, pins, and other digital circuit components refer to the software representation 
of the physical digital circuit component, 

A digital circuit can be specified hierarchically. Some or all of the parts in the 
digital circuit may themselves be digital circuits composed of more interconnected 
parts. When a high level part is specified as a digital circuit composed of other, lower 
level parts, the pins of the high level part become the primary inputs and primary 
* outputs for the digital circuit comprising the lower level parts. When a high level part is 
composed of lower level parts, it is called a level of hierarchy. 

Following are additional definitions of terms which are used in this document. 

An HDL is a Hardware Description Language. HDL*s are used to describe 
designs for digital circuits. 

A Translated Circuit, Generic Technology Circuit, or GTech Circuit is a 
software representation of a digital circuit which does not include references to a 
specific technology, but rather refers to cells that implement generic logic such as 
"and", "or", and "not". This software representation is stored in memory 104 of 
computer system 100. 

A Mapped Circuit is a software representation of a digital circuit which is built 
firom parts available in a technology library which is provided by a silicon vendor. This 
software representation is stored in memory 104 of computer system 100. A mapped 
circuit can be timed using a conventional timing verifier such as DesignTime, available 
fi-om Synopsys, Inc. in Mountain View,' Calif. After it is built, a netlist representation of 
a mapped circuit can be sent to a silicon vendor for layout and fabrication. For instance, 
the mapped circuit can be written out using LSI netlist format and sent to LSI Logic in 
Milpitas, Calif. The process of creating a mapped circuit from a generic technology 
circuit is called mapping. Because a circuit must be mapped before it can be timed, 
mapped circuits are also used intemally by synthesis tools. 

The Fanout of a circuit element includes any circuit elements which are driven 
by that circuit element. The transitive fanout of a circuit element includes all of the 
circuit elements in the circuit which are driven, either directly or indirectly, by that 
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circuit element. Thus, the transitive fanout of a circuit element includes the fanout of 
that circuit element, as well as the fanout of each of the circuit elements in the original 
fanin, and so on. 

The Fanin of a circuit element includes any circuit elements which drive that 
circuit element. The transitive fanin of a circuit element includes alfof the circuit 
elements in the circuit which drive, either directly or indirectly, that circuit element. 
Thus, the transitive fanin of a circuit element includes the fanin of that circuit element, 
as well as the fanin of each of the circuit elements in the original fanin, and so on. 

An Operator is a function, such as addition. Such functions are used in HDL 
source code. For example, the plus in "c=a+b;" is an operator. 

An Operation is a software representation of a hardware functional unit which 
performs a function such as addition. For example, a software representation of an 
adder is an operation. 

A Clock Cycle is a period of time, for example 10ns, between pulses of a 
clocking element in a digital circuit. The clocking element is used to synchronize the 
digital circuit. 
3.0 Scheduling 

Scheduling is a well defined problem which has been studied extensively. An 
overview of the scheduling problem is available in The Hieh-Level Svnthesis of Digital 
Systems by Michael McFarland, Alice Parker, and Raul Camposano, in Proceedings of 
the IEEE, February 1990, which is hereby incorporated by reference. 

The input to a scheduler is typically a set of hardware operations, a set of 
constraints between the hardware operations, a clock period, and a set of control steps 
into which the hardware operations must be mapped. The output is a schedule where 
each hardware operation is mapped to a control step. 

Schedulers typically use a number of graphs. For instance, the constraints for a 
scheduler are often represented using a graph. Nodes in the graph typically represent 
events to be scheduled, such as operations, and edges in the graph represent constraints 
between the events. The scheduler checks the constraint graph to ensure that all of the 
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constraints are met before placing an event into a particular control step. Schedulers 
also use control graphs, data flow graphs, and combination control data flow graphs 
(CDFG*s). Control graphs represent the flow of control in a circuit. Data flow graphs 
represent the flow of data in a circuit; that it the flow of data from the inputs to the 
5 outputs of the circuit. Control data flow graphs combine both control flow and data flow 
information into a single graph. All of these types of graphs are described in High-Level 
Synthesis (subtitled Introduction to Chip and System Design ^ by Daniel Gajski, Nikil 
Dutt, Allen C-H Wu, and Steve Y-L Lin, Kluwer Academic Publishers, 1992 which is 
hereby incorporated by reference and will subsequently be referred to as High-Level 

10 Synthesis bv Gajski et al. 

An additional technique used for scheduling circuits involves "templates". 
Templates are described in Scheduling using Behavioral Templates by Tai Ly, David 
Knapp, Ron Miller, and Don MacMillen in Proceedings of the 31st DAC, June 1995, 
which is included as Appendix A and is hereby incorporated by reference. Simply 

15 speaking, templates are data structures which specify scheduling constraints among 
CDFG nodes. Templates "lock" the control step relationship between 2 or more CDFG 
nodes. Figure 13 shows an example of two templates, template 1250 and template 1280. 
Each template contains one or more nodes, some of which may represent operations. 
For example, adder node 2020 represents adder 3120 of Figure 12. 

20 3.1 Overview of Synthesis with Scheduling 

Figure 6 is a flowchart showing how scheduling steps fit into the overall 
synthesis strategy. This flowchart shows how a mapped circuit is created from a source 
HDL description. The input to synthesis is an HDL description of a digital circuit. Such 
a description may be written in VHDL, Verilog, or some other HDL. 

25 An HDL description is translated in step 810 to generic logic. A conventional 

HDL translator 1310 such as VHDL Compiler version 3.2b from Synopsys, Inc. in 
Mountain View, Calif, preferably is used. 

Step 820 performs scheduling preprocessing steps. These steps are shown in 
Figure 7 and Figure 8. 
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Step 830 schedules the operations in the circuit. A method for scheduling the 
operations in the circuit is shown in Figure 9. 

Step 840 netlists the scheduled circuit. Netlisting creates a GTech circuit from 
the scheduled CDFG. The CDFG representation of the circuit in memory is transformed 
5 into a GTech representation of the circuit in memory. 

In step 850, the resulting GTech circuit is optimized using conventional logic 
synthesis such as Design Compiler version 3.2 b by Synopsys, Inc. in Mountain View, 
Calif The output of logic optimization is a mapped circuit description which can be 
sent to a silicon vendor for fabrication. For example, a description of the mapped circuit 
10 can be output using LSI Netlist format and sent to LSI Logic in Milpitas, Calif for 
fabrication. 

3.2 Scheduling Preprocessing 

Figure 7 is a flowchart which shows steps for scheduling preprocessing. The 
input to the method is an annotated GTech circuit. Annotations on the circuit include 
15 delayed signal assignment information. The use of delayed signal assignments will be 
discussed in a later section. 

Step 910 extracts a control graph from the annotated GTech using conventional 
techniques. In addition, information concerning delayed signal assignments is extracted 
as described below. 

20 Step 920 extracts a Control Data Flow Graph (CDFG) from the control graph 

created in step 910 and the data flow graph represented by the GTech circuit. This is 
also done using conventional techniques. 

Step 930 creates initial templates for the operations in the CDFG as described in 
Scheduling Using Behavioral Templates in Appendix A. These initial templates form 

25 the initial constraint graph. 

Step 940 inserts constraints in the constraint graph. Some types of constraints 
are discussed in Scheduling Using Behavioral Templates in Appendix A. Other types of 
constraints are a part of the present invention and will be discussed in subsequent 
sections. 
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3.3 Inserting Constraints 

Step 940 of Figure 7 is implemented by Figure 8 which is a flowchart which 
shows steps for inserting constraints into a constraint graph which uses templates. The 
input to the process is a CDFG and a constraint graph. 
5 Step 1110 identifies Loop Carry Dependency (LCD) producer consumer pairs. 

LCD's are identified by tracing the CDFG using conventional techniques. LCD's are 
discussed below in connection with Figure 1 1, Figure 12, Figure 13, Figure 14, and 
Figure 27. 

Step 1 120 constrains the LCD's. Constraining LCD's involves adding constraints 

10 to the constraint graph so that producer and consumer operations are scheduled so that 
the consumer consumes a value produced by the producer before it is overwritten in a 
subsequent iteration of the loop. A method and apparatus for constraining LCD's will be 
discussed in a later section. 

Step 1 130 identifi-es memory and I/O access dependencies in loops which will 

15 be scheduled using pipelines. I/O accesses include reads and writes to memories, 
signals, and ports. Reads and writes in one iteration of the loop may not "cross," or 
occur after, reads and writes in subsequent iterations of the loop. Specifically, all reads 
and writes to the same memory, and all reads and writes to the same signal or port in 
one iteration of the loop must occur before any reads or writes to the same memory, 

20 signal or port in a subsequent iteration of the loop. The one exception to this rule is that 
reads of the same signal or port may occur simultaneously to a read of the same signal 
or port in a subsequent iteration of the loop. This step finds the first and last accesses for 
each memory, signal, or port by tracing through the CDFG using conventional 
techniques. Memory and I/O accesses are discussed below in connection with Figure 

25 1 5, Figure 1 6, Figure 1 7, Figure 1 8, and Figure 28. 

Step 1 120 constrains the memory and I/O accesses in pipelined loops. 
Constraining memory and I/O accesses involves adding constraints to the constraint 
graph so that first and last accesses are scheduled so that the last access occurs before 
the first access in a subsequent iteration of the loop. A method and apparatus for 
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constraining memory and I/O accesses will be discussed in a later section. 

Step 1 130 inserts other types of constraints into the constraint graph. Such 
constraints are discussed in Scheduling Using Behavioral Templates in Appendix A. An 
example of another type of constraint is a dataflow constraint, which ensures that data 
5 values are produced before they are consumed by subsequent operations. 
3.4 Scheduling Templates 

Figure 9 is a flowchart which shows steps of scheduling (step 830 of Figure 6) 
using templates. The input to the process is the CDFG and the constraint graph created 
by the steps of Figure 8. It is possible to schedule templates using many different 
10 scheduling techniques. A number of scheduling techniques are described in High-Level 
Synthesis by Gajski et al, particularly in Chapter 7. This figure shows a general method, 
which is provided as an example. 

Step 1010 creates the As Soon As Possible (ASAP) and As Late As Possible 
(ALAP) schedules for each template while satisfying the constraints represented in the 
15 constraint graph. The ASAP schedule places each template into the earliest possible 
control step (c-step). The ALAP schedule places each template into the latest possible 
control step. Together, the earliest and latest control steps define a range into which 
each template may be scheduled. A method for determining the ASAP and ALAP 
schedules for templates is described in Scheduling Using Behavioral Templates in 
20 Appendix A. 

Loop 1020 loops until a "good" schedule is found. A "good" schedule is one 
which fulfills the constraints specified in the constraint graph and optimizes for a 
specific goal specified by a human designer, such as fewest number of control steps. 
Different scheduling techniques use different criteria for deciding when to stop trying to 
25 improve the schedule. For example, one technique might stop when the constraints are 
all met, or when a certain amount of CPU time has been spent, whichever comes last. 

Step 1030 picks a template in the constraint graph to schedule. Different 
techniques use different criteria for deciding what to schedule next. Generally, template 
scheduling techniques use criteria based upon the operations in a template. For instance. 
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a list scheduling technique which uses priorities will assign a priority to a template 
based on the priorities of the operations within the template. (List scheduling is 
described in High-Level Synthesis by Gajski et al in Chapter 7). 

Step 1040 schedules the chosen template in the control step chosen by the 
scheduling technique being used. Templates are scheduled by placing the first operation 
within the template into the chosen control step and the remaining operations within the 
template into subsequent control steps as defined by the template. 

Arrow 1050 indicates that loop 1020 iterates until a "good" schedule is found. 

4.0 Method for Creating Constraints 

This section describes a general technique for constraining the relationship 
between two nodes in a constraint graph. Such constraints are added in step 940 of 
Figure 7. The section then describes examples of using this technique to constrain loop 
carry dependencies and I/O dependencies. 

4.1 Placeholder Node Method 

Figure 10 shows a general method for creating a scheduling constraint between 
two nodes in a constraint graph.' Such constraints are created in step 1 120 and step 1 140 
of Figure 8 to constraint LCD's and memory and I/O accesses. This section shows a 
general method and discusses specific examples. The first example constrains an LCD; 
the second example constrains a pair of signal reads. The input to the process of Figure 
10 is a constraint graph, two templates in the graph, Event 1 and Event 2, an integer n, 
and a number of cycles c. "n" is the number of cycles within which Event 2 must be 
scheduled after Event 1. "c" is either 0 or I. "c" has value 0 when Event 2 must be 
schedule before n cycles after Event 1, and value 0 when Event 2 may be scheduled 
exactly n cycles after Event 1. 

Step 610 adds a placeholder node H to the template for Event 1 in the constraint 
graph. A placeholder rode is a node in the constraint graph which is only used to create 
constraints. The placeholder node does not represent any portion of the final circuit. 
Placeholder node H is inserted into the Event I's template such that it is locked n cycles 
after Event 1. 
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Step 620 adds a constraint in the constraint graph from Event 2 to placeholder 
node H which constrains Event 2 to occur c cycles before placeholder node H, where c 
is 0 or 1. The value of c depends on the constraint being added and will be discussed in 
subsequent sections. 

4.2 Using Placeholder Nodes for Loop Carry Dependencies 

The following section provides an example of constraining loop carry 
dependencies using placeholder nodes. Such constraints are created in step 1 120 of 
Figure 8. A loop carry dependency is a data value which is produced in one iteration of 
a loop and consumed by operations in subsequent iterations of the loop. To use the 
placeholder node method to schedule loop carry dependencies, Event 1 is set to be the 
operation which consumes the data. Event 2 is set to be the operation which produces 
that data. Event 2 must be scheduled so that the correct data values are driving it when it 
feeds its outputs to Event 1. If the consumer (Event 1) consumes the data one iteration 
after the producer (Event 2) creates it, then n is set to be the initiation interval of the 
loop. If the consumer consumes the data k iterations after it is created by the producer, 
then n is set to be k * initiation interval. For LCD's, "c" has value "1" because the 
producer must be scheduled before the consumer in the subsequent iteration of the loop. 

Figure 1 1 shows an example of Verilog source code for a loop 3030 with a loop 
carry dependency between addition 3020 and subtraction 3010. The output of addition 
3020, p, drives the input of subtraction 3010 on the next iteration of the loop, "p" is a 
Loop Carry Dependency. In this example, a human designer has specified that loop 
3030 will be scheduled using an initiation interval of 2 and a latency of 4. Although this 
loop would not usually be pipelined because pipelining does not increase its throughput, 
this simple example is used for the sake of clarity. 

Figure 12 shows a GTech circuit representation 2000 which is created for loop 
3030 in Figure U. The GTech circuit representation is stored in memory 104. GTech 
circuit 2000 is output from step 810 of Figure 6. Addition 3020 is implemented as adder 
3120, and subtraction 3010 is implemented as subtracter 3110. Port p 2040 drives 
subtracter 3 1 10. Port p* 2045 is driven by adder 3 120. Port p 2040 and port p! 2045 are 
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partner ports. Partner ports are ports which represent the same signal, and thus 
frequently embody loop carry dependencies. Partner ports contain references to their 
partners. In the described embodiment, these references are implemented as pointers. 
Each port which has a partner contains a pointer to its partner port. 

Figure 13 shows a constraint 1270 between adder node 2020* which is the 
producer for this LCD, and subtracter node 2010 which is the consumer of this LCD. 
The consumer and producer were identified in step 1110 of Figure 8. This constraint is 
created using the method of Figure 10. The starting templates are shown in Figure 
13(a). First step 610 of Figure 10 adds placeholder node H 2060 to the template 1250 of 
subtracter node 2010. Because the initiation interval for the loop is 2, placeholder node 
H 2060 is constrained to be 2 cycles after subtracter node 2010 by template 1250. Next, 
step 620 creates constraint 1270, represented by an arrow, which constrains adder node 
2020 to be at least one cycle before placeholder node H 2060. The modified templates 
and the new constraint ar-e shown in Figure 13(b). The new constraint is then used to 
schedule the loop correctly using a method such as the one shown in Figure 9. 

Figure 27 shows the add and subtract operations of Figure 12 scheduled into 
control steps by step 830 of Figure 6. For the sake of clarity, the other operations in the 
circuit are not shown. Two iterations of the loop are shown, to demonstrate how the 
schedule properly handles the loop carry dependency. Adder 3120 is scheduled so that 
its result is available before subtracter 31 10 needs it in the next iteration of the loop. 

Figure 14 shows the circuit created fi-om the Verilog HDL source code of Figure 
1 1 after scheduling. Block 3190 represents the representation of the FSM controller for 
this circuit stored in memory 104. 
43 Using Placeholder-Nodes for I/O Dependencies 

Loop pipelining must preserve the original order of all reads and writes to the 
same memory, signal, or port. The placeholder node method can be used to create 
constraints which ensure that I/O accesses in different iterations of the loop do not cross 
one another. Such constraints are created in step 1 140 of Figure 8. The last I/O access to 
the same memory, signal, or port in a loop must occur simultaneously to or before the 
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first I/O access to that memory, signal or port in the next iteration of the loop. 
Specifically, reads of the same signal or port may occur simultaneously with reads in 
the next iteration of the loop, but not after. Writes to the same signal or port must occur 
before any read or write to the same signal or port in the next iteration of the loop. 
5 Reads and writes to the same memory must occur before any read or write to the same 
memory in the next iteration of the loop. 

Thus, any last I/O access must occur within the initiation interval of the first I/O 
or memory access. To create this constraint. Event 1 of Figure 10 is set to be the first 
I/O access to a given memory, signal or port. Event 2 of Figure 10 is set to be the last 

1 0 I/O access to a given memory, signal or port, n is set to be the initiation interval of the 
loop, and c is set to be 0 or 1 . Specifically, c is set to be 0 if Event 1 and Event 2 are 
signal or port reads, c is set to be 1 if Event 1 or Event 2 are signal or port writes, or 
memory reads or writes. 

Figure 15 shows an example of Verilog source code for a loop 1530 with an I/O 

1 5 dependency between read 1 5 1 0 and read 1 520. Both read 1 5 1 0 and read 1520 read the 
value of the same signal, x. Thus, read 1520 must be scheduled such that it occurs 
before read 1510 in the next iteration of the loop. In this example, a human designer has 
specified that this loop 1530 will be scheduled using an initiation interval of 1 and a 
latency of 3. 

20 Figure 16 shows the GTech circuit 1500 which is created for loop 1530 of 

Figure 15. Circuit 1500 is output fi-om step 810 of Figure 6. Read 1510 is implemented 
by read operation 3130. Read 1520 is implemented by read operation 3140. In this 
example, a human designer has specified that this loop will be pipelined with an 
initiation interval of 1 and a latency of 3. 

25 Figure 17 shows a constraint between read node 1610, the first read of x in loop 

1530, and read node 1620, the last read of x in loop 1530. Read node 1610 and read 
node 1620 were identified in step) 1130 of Figure 8. This constraint is created using the 
method of Figure 10. First step 610 adds placeholder node H 1760 to the template 1750 
of read node 1610. Placeholder node H is constrained to be 1 cycle after read node 



J 



Page 22 of 37 Express Mail No. EK051313780US 

T.A. Ly et al. 

1610, because the initiation interval is 1, by template 1650. Next, step 620 creates 
constraint 1770, represented by an arrow, which constrains read node 1620 to be at least 
0 cycles before, that is in the same cycle or after, placeholder node H 1760. Read node 
1620 is constrained to be 0 cycles before placeholder node H 1760 because read node 
5 1620 and read node 1610 are both signal reads, and as such are allowed to occur in the 
same control step. Constraint 1770 is then used to schedule the loop correctly using a 
method such as the one shown in Figure 9. 

Figure 28 shows read operations on signal x of Figure 16 scheduled into control 
steps by step 830 of Figure 6. For the sake of clarity, the other operations in the circuit 
10 are not shown. Two iterations of the loop are shown, to demonstrate how the schedule 
properly handles the multiple signal reads. Read 3130 is scheduled so that it occurs 
simultaneously with read 3140 in the next iteration of the loop. Since simultaneous 
signal reads are allowed, this is a legal schedule. 

Figure 18 shows the circuit created from the Verilog HDL source code of Figure 
15 11 after scheduling. 

5,0 Circuit Synthesis using Delayed Signal Assignment Information 

Conventional design methodology uses a simulator to verify the correctness of a 
design both before and after it is synthesized. Conventional simulation systems, 
especially those systems performing behavioral synthesis, do not always yield identical 
20 cycle timing characteristics when HDL source code is simulated and when a synthesis 
output (a representation of a synthesized circuit) is simulated. It is advantageous for 
behavioral synthesis to be able to infer a circuit which will have the same cycle by cycle 
behavior during simulation as the simulation of the source HDL. 

The source code of Figure 19(a) is written in the Verilog circuit specification 
25 language. The source code of Figure 19(b) is written in the VHDL circuit specification 
language. Both Verilog and VHDL are Hardware Description Languages (HDLs). 

In Figure 19(a), the Verilog source code includes a signal assignment statement: 
c<=#24x-p; 

This statement includes a delay clause ("#24") indicating that a delay of twenty- 
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four time units, e.g., nanoseconds, should pass before the write operation is performed 
by the circuit that is to be generated. The delay clause is an example of delayed signal 
assignment information. Note that the inclusion of the delay clause in the HDL indicates 
a delay of the write operation only. The delay clause does not cause a delay in the 
5 performance of the subtraction operation. Similarly, in Figure 19(b), the VHDL source 
code includes a signal assignment statement: 
c<=transportx-p after 24 ns; 

This statement also contains a delay clause ("after 24 ns") indicating that a delay 
of twenty-four time units should occur in the generated circuit before the write 

10 operation is performed. This delay clause is a further example of delayed signal 
assignment information. 

A circuit loop generated from the HDL source code of Figure 19(a) and Figure 
19(b) will have an initiation interval of "2" because each source code example has two 
"wait" (or "posedge" or "negedge") statements within the loop. As discussed below, the 

15 delay clause in the source code causes the resulting loop to have a loop latency of "4". 
Figure 19(a) and Figure 19(b) are included for the purpose of example only. The present 
invention can use any appropriate type of source code (VHDL, Verilog, etc.) to 
represent a delay clause. 

Figure 20 is a flowchart showing steps performed during translation step 810 of 

20 Figure 6 to generate a cdb. The exact placement of the steps of Figure 20 are not a part 
of the present invention and the steps also can be performed, for example, in the 
preprocessing step 820 of Figure 6. The input to Figure 20 is a representation of one of 
the source code examples of Figure 19(a) and Figure 19(b), such as a parse tree 
generated from the source code. The steps of Figure 20 are performed for each 

25 statement in the source code. The output of the translation step 810 and Figure 20 is a 
data flow graph (a "Gtech circuit") and a control flow graph (a "control data base" 
(cdb)). It will be understood by persons of ordinary skill in the art that the steps of 
Figure 20 and Figure 23 are performed by processor 109 of Figure 5, performing 
instructions stored in memory 104 of Figure 5, 
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In Step 2002, the processor determines whether the current source code 
statement is a signal assignment statement (e.g., an assignment to a port using the "<=" 
operator) that includes a delay clause (e.g., "#24" in Verilog or "after 24 ns" in VHDL). 
If not, in step 2002, the processor performs standard processing for the node to build a 
5 node in the data flow graph. If the current source code statement includes a delay 
clause, then, in step 2004, the processor builds a write operation node in the data flow 
graph and annotates the node by adding an attribute indicating delayed signal 
assignment information to show that the write operation corresponding to the write 
operation node has a delay of, e.g., 24 nanoseconds (see node 21 14 of Figure 21 and 
10 Figure 22). 

Figure 2 1 shows an example of a data flow graph 2 1 00 generated from one of 
the source code examples of Figure 19(a) and Figure 19(b) in accordance with the steps 
of Figure 20. A representation of data flow graph 2100 is stored in memory 104. Data 
flow graph 2100 includes as inputs a port x, a register p, and ports y and z. Each port 

1 5 has zero or more read operation nodes ("read op" ) 2 1 02, 2 1 04, 2 1 06 associated 

therewith and each read operation node has an attribute indicating a port name (e.g., 
"port=*x'"). Respective ones of the inputs are input to a subtracter node 2110 and an 
adder node 21 12. Subtracter node 21 10 is connected to a write operation node 21 14. 
Adder node 21 12 is connected to a variable assignment node 2116. Output p' is input as 

20 p during successive iteration of the loop. Thus, the data flow graph of FIGURE 2 1 has 
seven nodes representing the data flow in the circuit to be synthesized. 

In step 2008 of Figure 20, if there are more statements in the source code, 
control returns to step 2002. If all statements have been processed and a data flow graph 
(including signal delay attributes) has been generated for the source code, control passes 

25 to step 2012, where a control flow graph, such as that in Figure 22 is created. 

Control graph 2200 of Figure 22 adds control information to nodes 2102, 2104, 
2106, 21 10, 2112, 21 14, and 21 16 indicating the order and conditions under which the 
data flow nodes are executed in the synthesized circuit. A representation of control 
graph 2200 is stored in memory 104 of Figure 5. The present invention preferably 
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operates in a "cycle fixed mode" in which each "wait" (or "posedge" or "negedge") 
statement in the source code indicates a new cycle in the synthesized circuit. Various 
processes for generating of control flow graphs are known to person of ordinary skill in 
the art and are described in High-Level Synthesis by Gajski et al. 
5 In Figure 22, cnodes are used as "placeholder" nodes in the control graph to 

represent a collection of data flow nodes. Thus, cnode 2200 is associated with write 
operation node 2114 (including the signal delay attribute), read operation node 2102, 
and subtracter node 2110. The wait nodes in Figure 22 are used to represent the 
transitions between each cycle (or "cstep"). A wait node 2204 is used to mark the 

10 transition between the first cstep (cstep 0) and the second cstep (cstep 1). Wait node 
2204 also has attributes indicating that it is based on a rising clock edge (due to the 
"posedge" statement in the source code) "Wait statements" (in VHDL source code) are 
treated similarly. Cnode 2206 (located in the second cstep) is associated with variable 
assignment node 21 16, read operation node 2104, read operation node 2106, and adder 

15 node 21 12. The control graph also includes a second wait node 2208 and a third cnode 
2210. 

As shown in Figure 7, the control flow graph is input to step 920, where a 
control data flow graph (CDFG) is created. The general procedure for creating a 
conventional CDFG is known to person of ordinary skill in the art and is described in 

20 Hieh-Level Synthesis by Gajski et al. Figure 23 shows certain details of the process of 
creating a CDFG that relate to the delay clause of the present invention. An example 
CDFG is shown in Figure 24. The steps of Figure 23 are performed for each loop in the 
control flow graph. In step 2302, the processor sets a Wait.sub.- count variable and a 
Max.sub.- wait.sub.-- count variable in the memory 104 to an initial value of "0". In 

25 step 2304 the processor builds a "loop begin" node in the CDFG and assigns to it a cstep 
attribute value equal to "0". 

Step 2306 is a first step in a loop performed by the processor for each cdb node. 
In step 2308, if the current cdb node is a cnode, control passes to step 2310, which is a 
first step in a loop performed for all data flow nodes associated with the current cdb 
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node. In step 2312, if a current data flow node is a write operation node having a delay 
clause (i.e., if the current data flow node represents a delayed signal assignment), 
control passes to step 2322. 

In step 2322, a temp.sub.- wait.sub.- count variable is set to the current value 
5 of Wait.sub." count + a number of delay time units in the delayed sfgnal assignment 
divided by the clock period (e.g., 0+24/6=4). A CDFG node is created and assigned to 
cstep temp.sub.-- wait.sub.- count in step 2324. In step 2326, if temp.sub.- wait.sub.- 
count is greater than Max.sub.- wait.sub.- count, then in step 2328, Max.sub.- 
wait.sub.- count is set equal to temp.sub.- wait.sub.- count. Otherwise, control passes 
10 to step 2342. If, in step 2342, there are more data flow nodes associated with the current 
cdb node, then control passes to step 2310. Otherwise control passes to step 2336. 

If, in step 2312, the current data flow nodes not a delayed signal assignment, the 
processor builds a standard CDFG node in step 2314 and assigns the created data flow 
node to cstep wait.sub.- count in step 2316. If, in step 2318, wait.sub.- count is greater 
15 than Max.sub.- wait.sub.- count, then Max.sub.- wait.sub.- count is assigned to 
wait.sub.- count in step 2320. Control next passes to step 2342. 

If, in step 2308, the current cdb node is not a cnode, then control passes to step 
2330. If in step 2330 the current cdb node is a wait node, then wait.sub.— count is 
incremented in step 2332 and control passes to step 2336. If, in step 2330, the current 
20 cdb node is not a wait node, then regular processing is performed to create a CDFG 
node in step 2334 and control passes to step 2336. 

In step 2336, if there are more cdb nodes to process, then control passes to step 
2306. Otherwise, a loop.sub.- latency variable in memory 104 for the loop is assigned 
to Max.sub.- wait.sub." count and an initiation interval variable for the loop is 
25 assigned to wait.sub.- count in step 2338. In step 2340, the processor builds a "loop 
end" node in the CDFG and assigns it to cstep wait.sub,- count. 

The output of step 920 of Figure 7 is input to the scheduler, which uses the 
CDFG and the loop initiation interval and loop latency to schedule the nodes of the 
circuit being generated. In the described embodiment, all nodes except read/write 
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operation nodes can "float", i.e., can be moved between csteps by the scheduler to allow 
the scheduler to create an efficient circuit design. In the CDFG, these nodes are always 
assigned a cstep value equal to the initial cteps in which they appear in the HDL as a 
"suggestion" to the scheduler. It will be understood by persons of ordinary skill in the 
5 art that the CDFG of Figure 24 has been simplified for the sake of example and that the 
CDFG also includes, e.g., data flow arcs connecting the CDFG nodes that represent data 
flows in a similar manner to the data flows of Figure 21. 

Figure 14 shows an example circuit synthesized from the CDFG of Figure 24. 
Figure 25 shows an example of placement of CDFG nodes in csteps without and with 

10 use of the delay clause. In the left column, which represents CDFG without the delay 
clause, CDFG nodes corresponding to write operation node 2114, read operation node 
2 1 09, and subtracter node 2 1 1 0 are assigned to cstep 0. Similarly, CDFG nodes 
corresponding to adder node 2112, read operation node 2104, read operation node 2106, 
assignment node 2116 (and a CDFG loop.sub.-- end node) are assigned to a second 

1 5 cstep 1 . Generation of this CDFG representation causes the synthesizer to generate a 
circuit that has different timing characteristics than the characteristics generated by the 
circuit synthesizer when the source code includes a delay clause. The right column of 
Figure 25 shows the assignment of CDFG nodes to cycles in accordance with the 
present invention. In this example, a write operation node corresponding to write 

20 operation node 2 1 1 4 is moved into cstep 4 during the steps of FIGURE 23. This 

modification of the process to generate the CDFG (possible because of an addition of a 
signal delay attribute to the data graph 2100) allows the synthesis process to generate a 
circuit that has cycle level simulation behavior that is substantially identical to that of 
the cycle level simulation behavior of the source HDL. 

25 Figure 26 shows an example of loop pipelining when the present invention is 

used. The figure shows an nth iteration of the loop and an n+lst iteration of the loop 
over time. As can be seen in the figure, the initial interval of successive iterations of the 
loop is equal to a number of wait statements (or "posedge" or "negedge" statements). 
The loop latency, is equal to the longest cycle delay fi-om the beginning of the loop to a 
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latest operation. The throughput of the pipelined loop is not decreased by use of delayed 
signal assignments. In general, the scheduler will schedule a circuit having the CDFG of 
Figure 24 as a pipelined circuit because the loop latency is longer than the initiation 
interval. 

5 In summary, use of delayed signal assignments allows behavioral synthesis to 

infer circuits with pipelined loops which have cycle level simulation behavior which 
matches that of the source HDL. Pipelined loops may include loop carry dependencies 
and/or I/O and/or memory accesses which must be scheduled correctly. The use of a 
placeholder node within a template is an efficient representation of such scheduling 

10 constraints. 
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WHAT IS CLAIMED IS: 

1 . A method performed by a data processing system having a memory, comprising the 
steps of: 

5 parsing a text description of a circuit, said text description stored in the memory, 

said text description including a loop with a delayed signal assignment having a delay 
value; 

translating said text description into a digital circuit representation in said 
memory, said digital circuit representation including a pipeline; and 
10 setting a latency of said pipeline equal to said delay value, 

2. The method of claim 1, wherein said loop further includes N wait statements, where 
N is greater than zero, said method further comprising the step of setting an initiation 
interval of said pipeline equal to N. 

15 

3. The method of claim 1, wherein said text description is written in Verilog and said 
delayed signal assignment uses a Verilog operator. 

4. The method of claim 3, wherein said wait statements use Verilog "@posedge" 
20 statements. 

5. The method of claim 3, wherein said wait statements use Verilog "@negedge" 
statements. 

25 6. The method of claim 1 , wherein said text description is written in VHDL, said 
delayed signal assignment uses a VHDL "after" clause, and said wait statements use 
VHDL "wait" statements. 



7. A method, performed by a data processing system having a memory of building a 
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digital circuit representation including a pipeline in the memory from a textual 
description of a loop, comprising the steps of: 

identifying a loop carry dependency in said loop; 

identifying a producer operation of said loop carry dependency; 
5 identifying a consumer operation of said loop carry dependency; 

determining a number, n, of cycles within which said producer operation must 
be scheduled after said consumer operation; 

instantiating a placeholder node in said memory; 

node-locking said placeholder node so that it must be scheduled n cycles after 
10 said consumer operation; and 

constraining said producer operation to be scheduled before said placeholder 

node. 

8. The method of claim 7, wherein the step of node-locking said placeholder node 
1 5 further comprises the step of creating a template structure in said memory which 

includes said placeholder node and said consumer operation. 

9. The method of claim 8, 

wherein said producer operation is included in a second template structure in 
20 said memory, and 

wherein the step of constraining said producer operation further comprises the 
step of constraining said second template structure to be scheduled before said template 
structure. 

25 1 0. The method of claim 7, wherein n is equal to an initiation interval of said pipeline 
multiplied by a number of iterations of said loop which execute before data produced by 
said producer is consumed by said consumer. 

11. A method, performed by a data processing system having a memory, of building 
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a digital circuit representation in said memory, said digital circuit representation 
including a pipeline derived from a textual description of a loop, said method 
comprising the steps of: 

identifying an access dependency of said loop; 
5 identifying a first access operation of said access dependency; 

identifying a second access operation of said access dependency; 

determining a number, n, of cycles within which said second access operation 
must be scheduled after said first access operation; 

instantiating a placeholder node in said memory; 
10 node-locking said placeholder node so that it must be scheduled n cycles after 

said first access operation; and 

constraining a scheduling order of said second access operation and said 
placeholder node. 

15 12. The method of claim 1 1 , 

wherein said first access operation is chosen fi-om the group of access operations 
including a memory read, a memory write, a signal write and a port write, 

said second access operation is chosen fi-om the group of access operations 
including a memory read, a memory write, a signal read, a signal write, a port read and 
20 a port write, and 

the step of constraining said scheduling order of said second access operation 
and said placeholder node fiirther includes the step of forcing said second access 
operation to be scheduled before said placeholder node. 

25 13. The method of claim 1 1 , 

wherein said first access operation is chosen fi-om the group of access operations 
including a memory read, a memory write, a signal read, a signal write, a port read and 
a port write, 

said second access operation is chosen fi-om the group of access operations 
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including a memory read, a memory write, a signal write and a port write, and 

the step of constraining said scheduling order of said second access operation 
and said placeholder node further includes the step of forcing said second access 
operation to be scheduled before said placeholder node. 

5 

14. The method of claim 1 1 , 

wherein said first access operation is chosen from the group of access operations 
including a signal read and a port read, 

said second access operation is chosen from the group of access operations 
10 including a signal read and a port read, and 

the step of constraining said scheduling order of said second access operation 
and said placeholder node further includes the step of forcing said second access 
operation to be scheduled simultaneous with, or before said placeholder node. 

15 15. The method of claim 11, wherein the step of constraining said scheduling order of 
said second access operation and said placeholder node fiirther includes the step of 
forcing said second access operation to be scheduled before said placeholder node. 

16. The method of claim 11, wherein the step of node-locking said placeholder node 
20 further includes the step of creating a template which includes said placeholder node 

and said first access operation. 

17. The method of claim 11, wherein n is equal to an initiation interval of said pipeline 
multiplied by a number of iterations of said loop which execute between said first 

25 access operation and said second access operation. 

18. A system for building, in a memory, a digital circuit representation which 
implements the behavior of a text description in said memory, said system having a 
processor coupled to a memory unit wherein said processor is programmed to perform 
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logic processing, said system comprising: 

parsing logic for parsing said text description into a parsed text description, said 
text description including a loop with a delayed signal assignment having a delay value; 

translating logic for translating said parsed text description into said digital 
5 circuit representation, said digital circuit including a pipeline; and 

latency setting logic for setting a latency value of said pipeline to be said delay 
value of said delayed signal assignment. 

19, A system as described in claim 18, wherein said pipeline implements said loop. 

10 

20. A system as described in claim 19, wherein said loop further includes a number,,n, 
of wait statements, said system further comprising initiation interval setting logic for 
setting an initiation interval of said pipeline to be equal to n. 

15 21. A computer program product comprising: 

a computer usable medium having computer readable code embodied therein for 
building a digital circuit representation from a text description of a digital circuit, the 
computer program product comprising: 

computer readable program code devices configured to cause a computer to 
20 effect parsing said text description, said text description including a loop with a delayed 
signal assignment having a delay value; 

computer readable program code devices configured to cause a computer to 
effect translating said text description into said digital circuit representation including a 
pipeline; arid 

25 computer readable program code devices configured to cause a computer to 

effect setting a latency of said pipeline equal to said delay value. 

22. The computer program product of claim 21 wherein said loop further includes N 
wait statements, where N is greater than zero, said computer program product further 
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comprising computer readable program code devices configured to cause a computer to 
effect setting an initiation interval of said pipeline equal to N. 

23. A meth od performed by a data processing system having a memorv. 
5 comprising the steps of: 

parsing a text descripti on of a circuit, said text description stored in the memorv. 
said text descri ption including a loop with N wait statements, where N is greater than 
zero: 

translating said text description into a digital circuit representation in said 
10 memorv, said digital circuit representation including a pipeline: and 
setting an initiation interval of said pipeline equal to N. 

2Az The method of claim 23, wherein the wait statements are VHDL wait 

statements. 

15 

2L The method of claim 23, wherein the wait statements are Verilog HDL 

(Sbosedge statements. 

26. The meth od of claim 23, wherein the wait statements are Verilog HDL 
20 (Shegedpe statements. 

2L A system for building, in a memorv. a digital circuit representation 

which implem ents the behavior of a text description in said memorv, said system having 
a processor co upled to a memorv unit wherein said processor is programmed to perform 
25 logic processing, said system comprising: 

parsing logic for pa rsing said text description into a parsed text description, said 
text description including a loop with N wait statements, where N is greater than zero: 

translating logic for tran slating said parsed text description into said digital 
circuit represen tation, said digital circuit including a pipeline: and 



1 
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initiation internal setting logic for setting an initiation interval of said pipeline 
equal to N. 

28. The system of claim 27. wherein the wait statements are VHDL wait 
5 statements. 

29. The system of claim 27. wherein the wait statements are Verilog HDL 
@posedge statements. 

10 30. The system of claim 27. wherein the wait statements are Verilog HDL 

@negedge statements. 

3 L A compu ter program product comprising a computer usable medium 
having computer readable code embodied therein for building a digital circuit 
15 representation from a text description of a digital circuit, the computer program product 
comprising: 

computer readable program code devices configured to cause a computer to 
. effect parsi ng said text description, said text description including a loop with N wait 
statements, where N is greater than zero: 
20 ' computer readable program code devices configured to cause a computer to 

effect translating said text description into said digital circuit representation including a 
pipeline: and 

computer readable program code devices configured to cause a computer to 
effect setting an initiation interval of said pipeline equal to N. 

25 

32. The metho d of claim 3 1 . wherein the wait statements are VHDL wait 
statements. 



33. Th e method of claim 3 1 , wherein the wait statements are Verilog HDL 



Page 36 of 37 Express Mail No. EK05 1 3 1 3780US 

T.A. Ly et al. 



(Sposedec statements. 

34. The method of claim 3 1 , wherein the wait statements are Verilog HDL 
(otnegedge statements. 
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Abstract 

METHODS FOR AUTOMATICALLY PIPELINING LOOPS 
A method and an apparatus for creating a representation of a circuit with a 
5 pipelined loop from an HDL source code description. It infers a circuit including a 
pipelined loop which has cycle level simulation behavior matching that of the source 
HDL. Loop carry dependencies and memory and signal I/O accesses within the loop 
scheduled correctly. 
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module loopcxS ( c, x, y, z, clock); 
input [1:0] x, y, z; 

input clock ; 

output [2:0] c; 

rcg [2:0] c; 

reg [2:0] p; 

always begin . 3030 

forevcrBcgin : thcloop -3010 




c<=x-p; 




•3020 



p=y+z; 



@(poscdgc clock) ; 



end 



end 



endmodule 



Figure 11 
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module write4 ( w, x. clock); 

input (15:0] x; 
input clock ; 
output [31:0] w; 
reg [32:0] w; 
reg[15:0Jxl ; 
reg [15:0] x2 ; 

always bcgin^^^^^^^^^^^^^^^..- — 1530 
forever begin : writeloop ^ 1530 



xl <Sx; 




x2 <= X ; 




1530 



w<=xl *x2; 



end 



end 



endmodule 



t 

Figure 15 
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module after I ( c. x. y. clock); 

input [1.-0] x.y.z; 
input clock ; 
output (2.-0] c; 
reg [2:0Jc: 
«gI2:0]p; 

always begin 

©Cposcdgc clock) ; 

forever begin 

c<=#24x.p; 

©(poicdge clock); 

P = y -fz: 

@(po$cdgc clock) ; 

end 

end 
endmodule 



Figure 19 (a) 

entity afterl is 
poft( 

c : out integer range 0 to 7; 
X, y. z : in integer range 0 to 3; 
clock : in bit 

); 

end afterl; 

architecture behavioral of afterl is begin 
process 

variable p : integer range 0 to 7; 

begin 

wait until clock'evcnt and clock = T; 
loop 

c <= transport x - p after 24 ns; 

wait until clock'evcnt and clock c * 1*; 

p:«y + z; 

wait until clock'evcnt and clock a T; 
end loop; 

I 

end process; 
end behavioral; 



Figure 19 (b) 
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Scheduling using Behavioral Templates 



Tai Ly, David Knapp, Ron Miller, Don MacMillen 
Synopsys Inc. 
700B E. Middlcficld Road 
Mountain View, CA USA 94043 



Abstract: This paper presents the idea of '"behavioral tem- 
plates'* in scheduling. A behavioral template locks several oper- 
ations into a relative schedule vrith respect to one another. This 
simple construct proves powerful in addressing: (1) timing con- 
straints, (2) sequential operation modeling, (3) pre-chaining of 
certain operations, and (4) hierarchical scheduling. We present 
design examples from industry to demonstrate the Importance 
of these issues in scheduling. 

1.0 Introduction 

The task of scheduling [4] is to sequence nodes in a control and daU 
flow graph (CDFG) by assigning each node to a control step 
(cstep). We present the idea of behavioral templates, and describe 
how we use behavioral templates to address several issues that arise 
when applying scheduling to commercial designs. For the purpose 
of this paper, we assume liming constrained scheduling [5]. 

A behavioral template specifies a relative scheduling among its 
member CDFG nodes. It is a template in the sense that its member 
nodes can be treated as a single scheduling unit by assigning the 
starting cstep for the tcrriplate. It is behavioral in the sense that it 
specifies a scheduling pattern as opposed to, for example, a struc- 
tural pattern [8]. We extend scheduling algorithms to handle behav- 
i ral templates by recasting the task of scheduling as that of 
assigning templates to csteps. 

Although a simple idea, behavioral templates provide a powerful 
way to address four issues in scheduling: 

L Timing constraints. We use behavioral templates to impose 
fixed and maximum timing coiutraints. This is more efficient 
than using precedence edges alone because an entire sequence of 
nodes is considered at once when scheduling one template. 

2. Multi-cycle operations. To enable scheduling of complex multi- 
cycle operations, we use multiple CDFG nodes locked in a 
behavioral template to nK>del the cycle-by-cycie I/O and resource 
requirements of such operations. 

3. Logic and bit-manipulation operations. We use behavioral 
templates to force certain chaining of logic and bit-manipulation 
operations to save register costs. This reduces the scheduling 
design space, and therefore nin times. 



4. CDFG hierarchy. We implement hierarchical scheduUng by 
inlining each scheduled subgraph, using a behavioral template to 
lock the inlined nodes accotxling to the subgraph's schedule. 

This paper is organized as follows. Section 2 compares this work to 
previous research. Section 3 defines behavioral templates. Section 4 
describes extending scheduling for behavioral templates. Section 5 
discusses applications. Section 6 presents results. Section 7 con- 
cludes this paper. 

2.0 Related Work 

The term "template" was used in (8] to describe structural patterns 
to exploit regularity. In [9] and [10], such templates are used to 
guide the clustering of CDFG nodes into super nodes which map to 
•'regular'* subctrcuits. Both of these works focus on extracting regu- 
lar patterns by pattern matching, whereas our woric focuses on how 
to schedule a set of behavioral patterns. Our behavioral templates 
do not represent repeating patterns, but specify local scheduling 
constraints among CDFG nodes. 

Most scheduling systems model multi-cycle operations using single 
CDFG nodes whose delays are greater than 1. In (7], multi-cycle 
operations are treated as multiple single-cycle operations. This 
turns out to be similar to our template-based model for sequential 
operations, except that we make deliberate use of cycle-by-cycle 
input/output and resource requirements to model complex opera- 
tions. 

Hierarchical scheduling based on super nodes are used in [9], [6], 
and [7]. We know of no other system which hierarchically sched- 
ules a design while taking advanUge of possible resource sharing 
between iKxles and edges in different subgraphs. 

3.0 Behavioral Templates 

We define a behavioral template, T, as a CDFG object which speci- 
fies a set of tuples, (Uj , Oj), where n^ is a CDFG node and Oj is an 
integer cycle offset The semantics is thatT imposes the constraint: 

schedule(ni) = schedulcCT) + o-, for all (nj , Oj) in T 

where scheduleCn^) and schedule(T) denotes the schedules for n^ 
and T, respectively. 

That is, if T is scheduled to cstep j, then every member node, n^, of 
T must be scheduled to the cstep, j Oj. This locks all nodes in T 
into a pattern of relative schedules, and we nuiy schedule the entire 
group of nodes by scheduling the template T itself. Fig. 1(a) shows 
a template, Tl c { (a,0) (b,l) (c2) (d4) (e.5) ), containing 5 nodes. 
All CDEKj edges have been omitted for clarity (In the figures, w 
show behavioral template as a box containing one or more nodes in 
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slots. The top slot in the box is offset 0, the second slot from top is 
offset 1 , and s on. For example, the node "c" in Fig. 1 (a) has offset 
5 in Tl because it is in the 6th slot from the I p of the box.) 

Whenever a node is a member ftw or more different templates, 
we can always merge these templates ini one. Consider the tem- 
plates TI and T2 in Fig. 1(a) and 1(b). If node g is to be added to 
template Tl at offset 1 , then we merge Tl and T2 into T3 of Fig 
1(c). 







(0 


















0 




(•) 




(b) 



























(0 


O cd^nodc 



cdfg behavioral Cemptate 
FIGURE 1. Template examples (a) TI s= ( (a,0) (bA) (cJ.) (dJi\ ftJ^k !• 

wri « I im (g.2) (h^ ); (?) T3 i { (f;o)\M)ffi) (iSS?^^^ 

4.0 Scheduling with Behavioral Templates 

Instead of scheduling individual CDFG nodes, we resute the sched- 
uling problem in terms of behavioral templates. Initially, we create 
one template for every CDFG node, and then merge templates 
whenever nodes are added to other templates. This ensures that 
every CDFG node is a member of one and only one template. The 
timing constrained scheduling task is then to schedule all templates 
to mininiize resource costs subjea to timing constraints between 
templates. This secUon describes how we extend existing schedul- 
ing algorithms for behavioral templates. 

4.1 Tuning Constraints between Templates 

From the CDFG. we construct a weighted, directed graph G=<V . E) 
where V is the set of aU behavioral templates in the CDFG, and E is 
the set of directed edges between templates. The weight d(Tj,, T.) of 
an edge c(Tj . Ty) in E specifies the minimum delay between the 
schedules of T, and Ty, i.c., 

schedule(TJ + d(T^ . Ty) <= schedule(Ty) Eq. 1 

The edges in E are constnictcd from the data/control dependencies 
between member nodes in the templates. For every pairof tcm- 
plates. T. and Ty, d(T, . Ty) is the maximum value of 



«^(ni.iij) + Oi.Oj 



Eq.2 



over all (ni . Oj) in T, and all (nj . Oj) in Ty, where w(n: . a) is the 
nunimum cycle delay from node n^ to node n^. 

Note that Eq. 2 can be negative. This means dCT, . T,) can be nega- 
tivc, and the graph G is not acycUc. If G contains any cycle of posi- 
tivc lengths, then the timing constraints arc unsatisfiablc. 7b check 
forpositivc cycles, we solv for the all-paixs-longest^ problem 
for G using a simple OCN^) algorithm, where N is the number of 
templates in G. The longest path lengths are stored in a matrix, LP 



for subsequent incremental update f the as soon as possible 
(ASAP) and as late as possible (ALAP) schedules. 

4.2 ASAP and ALAP Schedules 

At the start of scheduUng. we calculate the ASAP and ALAP sched- 
ules for all templates in G. to eslabUsh the scheduUng Ume frame 
for each template. Since G may contain negative weighted edges, 
we use a rclaxaUon algorithm similar to that in [3] to compute the 
initial ASAP/ALAP schedules: 

(1) Propagate along posiUve edges in E only; 

• for ASAP, propagate forward from the source of CDFG; 

• for ALAP. propagate backward from the sink of CDFG; 

(2) Relax schedules to satisfy constraints implied by negativ 
edges in E; 

(3) Repeat step 1 until no more changes in relaxation step. 

When there are no positive cycles in G, the above algorithm is guar- 
anteed to converge in e+I iterations where e is the number of nega- 
tive edges in E. The overall computational complexity is 0(N^e) 
where N is number of templates in G. 

The ASAP and ALAP schedules define the initial lime ft^es. Sub- 
sequenUy, as each template is scheduled, we update the Ume frames 
of other templates using the longest path lengths matrix. LP: 

schcduleCTj,) + LP(T^ . Ty) <= schcdulefTy) for aU T^ . Ty in V 

There is no need for relaxation in this incremental update because 
LP already takes into account all negative edges in E 

43 Cost Functions 

We use a number of iterati ve/constructive scheduling algorithms 
each of which successively picks an unscheduled template and 
schedules it to a cstcp in its time frame. The algorithms differ in 
how they pick the next template to schedule, and in how they pick 
which cstcp to schedule the template to. We define the template pri- 
ority/cost functions in terms of priority/cost functions on the CDFG 
nodes. 

For otample. in our implementation of list scheduling, the template 
priority function is defined as the maximum of its member nodes' 
priority values. This gives priority to the template containing the 
highest priority nodes. In our implemcnuuion of greedy scheduling, 
the incremental cost function for scheduling a template T={(nj . o^)) 
to a cstep j, is defined as the sum total of the incremental costs f r 
scheduling nodes nj to cstqjs j + Oj. 

Scheduling/de-schcduling moves on tcmptetcs arc implemented as 
moves on their member nodes. All data smicturcs arc updated as 
CDFG nodes are scheduled/de-schcduled. In particular, resource 
costs for functional units, registers and interconnects arc still com- 
puted according to the lifetimes and mutual exclusivity of CDFG 
nodes and edges. This approach is easy to implement and leverages 
prcvious work on scheduling CDFG nodes. 

4." Pre-assigned Operations 

AUowing negative edges in G requires that we xtcnd scheduling 
algorithms to handle maximum timing constraints. This is compli- 
cated by •^re-assignetf* operations, i.e., operaii ns that arc 
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assigned to specific resources before scheduling. Examples of pre- 
assigned operations are memory read/write operations f r the same 
RAM. W use a list scheduling algorithm to find an initial legal 
schedule based on source code ordering. However, list scheduling 
can fail to find a legal schedule when there are maximum Uming 
constraints. So we augment list scheduling with a recovery step. 
When list schcduh'ng fails, the recovery step relaxes the template 
schedules that caused scheduling failures, and iterates: 

1. List scheduling step: 

Successively consider operations in the ready list in increasing 
source code ordering. For each ready operation, nj, check its tem- 
plate, {...(nj , Oj).,.}, for scheduling in the cstep s - O;, where s is 
the cumcot cstep. Postpone scheduUng of if any of the following 
is true: 

• has a "relaxed cstep" (see step 2) which is greater than s - q 

• has no "relaxed cstep", but ASAP is greater than s • Oj 

• there is a resource contention if is scheduled to s - Oj 

When all nodes have been scheduled, exit with success. 

If Tj is pos^ned due to resource contention, and if s - is greater 
than or equal to the ALAP cstep for T^. then Ust scheduling has 
failed. Wien this happens, go to step 2 and try to recover. 

2. Recovery step: 

When Ust scheduUng fails to find a legal schedule for T,,, we try to 
rcc vcr by increasing its ALAP cstep and renin list scheduling in 
step L In order to increase the ALAP cstep for T^, we find all 
scheduled templates, Ty, for which 

alapCTJ = schediilcCTy) - LP(T, , Ty) Eq, 3 

where alapOV) is the ALAP cstep for T,,. For every such template. 
Ty, we set its "relaxed cstep" to scheduIeOy) + 1. This forces the 
next nm of list scheduling to schedule Ty one cstep later. 

This step exits with failure if any of the foUowing is true: 

• ALAPCTj^) is at maximum global cstep 

• there is no template Ty which satisfies Eq. 3 

• algorithm has iterated for N times (N is the # of templates) 



8. 

a* 

S 




® 
© 



same 

resource 



LPCr2,Tl) = 0 



nCimE Z Example of list KheduUng failure and recoverr U\ fi»t 
to^Uoo tails at T2 (b) .ecoad lUratS. a.^tlSxi^ W to 

Kg. 2 shows an example of this algorithm at work. In this example. 

CDFG nodes -a" and -c" are pre-assigned to the same resource 
•nie source code ordering has «a" before ^^c" before ^\ Initially 
list scheduling in step 1 will schedule Tl to cstep 0, and then faiU to 



schedule T2 because of resource contention at cstep 0, and because 
Its ALAP cstep is 0 nee Tl is scheduled t 0. In step 2, Tl will be 
assigned a relaxed cstep of 1. In the next iterarion. list scheduling 
first schedules T2 to cstep 0. then schedules Tl to cstep 1 to avoid 
res urce contention, and finally schedules T3 1 cstep 3. 

We have two recourses when the above algorithm fails: First we 
can conUnue to try other scheduling algorithms which may stiU find 
legal schedules. Second, we can insert precedence constraints to 
sequentialize prc-assigned operaUons by their source code ordering. 
Any unsaUsfiable maximum timiog constraints would then be 
detected as positive cycles in the graph G. 

5.0 AppIicaUons for Behavioral Templates 

This section highlights how we use behavioral templates t advan- 
tage. Fig. 3 shows the overall flow of our scheduling process. 

extract CDFG 

create teinpla(cP) 
I 

(^insert constraints^ 




FIGURE 3, Overall flow for hierarchical scheduling 
5,1 Inserting Timing Constraints 

After extracting the CDFG and creating the initial templates, user- 
specified timing constraints are added to the CDFG, Minimum tim- 
ing ronstraints are represented by precedence edges between nodes, 
but fixed timing constraints and maximum timing constraints arc 
represented with the help of behavioral templates. Fixed timing 
constraints are when two or more operations must be scheduled in a 
fixed number of cycles apart This is represented by adding one 
operation to the template of the other operation with the proper ff- 
set For example, if ijj must start kcstcps after nj starts, and if a is 
in template T= {...(nj . o^)... ). then we add n| to T at offset Oj + k 
(Rg. 4(a)); if nj must start k csteps after nj ends, and nj has i delay 
of d cycles, then we add nj to T at offset O; + d - 1 + k (Fig. 4(b)). 

However, if nj must start k csteps after n; ends, and n^ does not have 
a static delay (e.g., n; is a subgraph), then we decompose the fixed 
constraint into a k<ycle minimum timing constraint from the end of 
Oj to the start of Hj. plus a knycle maximum Uming constraint from 
the start of ijj to the end of n^. This is shown In Fig. 4(c). 

Note that in Fig. 4(c), we create a dummy place holder node, ph. 
and lock it in a template with iij (k cycles apart). TTiis template com- 
bmes with the precedence edge of weight 0 from ph to the end of tti 



Atty. Docket No. 4000/10 
Inventor: Tai A. Ly et al. 
Title: METHODS FOR ATUOMATICALLY 
PIPELINING LOOPS 

Jonathan T. Kaplan, Esq. 
Registration No. 38,935 
Brown Raysman Millstein Felder & Steiner LLP 
Attorney for Applicants 
1 20 West Forty-Fifth Street 
New York, New York 10036 
Phone: (2 1 2) 944- 1 5 1 5 Fax: (2 1 2) 840-2429 

APPENDIX A 
Sheet 4 of 7 



19 represcRt the maximum liming constraint from the start of iij to 
the end fn^. The precedence edge f weight k from the end fn^to 
start of iij, represents the minimum timing constraint from the end 
of Hj to the start of nj. 

In general a maximum timing constraint of k cycles from a set of 
nodes. A, to the set of nodes, B, is represented by creating two place 
h Idcrs, phi and ph2, fixed k cycles apart in a template, and insert- 
ing a Q- weight precedence edge from phi to all nodes in A, and 
inserting a 0-weight precedence edge from all nodes in B to ph2. 
Fig. 9(a) contains an example of this where the dunruny place holder 
nodes, t2 and t3, are used to lock a write operation to 0 cycle after 
the end of a loop. 



52 



0|^d-l-f k 




(b) 



FIGURE 4- Timing constraints (a) nj sUrts k cycles after n, starts, (b) Oi 
starts k cydes after n{ ends and Uf has sUtic delay d, (c) ni starts k cycles 
after n| ends and *s delay is not static 

SJ2 Modeling Multi-cycle Operations 

Behavioral templates also help model complex multi-cycle opera- 
tions. When a single CDFG node is used to model a multi-cycle 
operation, it imposes some limitations due to CDFG semantics: 

• Execution cannot start until ALL inputs arc available. 

• ALL inputs must be held stable throughout operation execution. 

• ALL outputs arc produced in the last cycle of execution. 

This makes it difficult to model, for example, a 3-cycle RAM write 
operation where the address must be stable for the firet two cycles, 
the data must be stable in the second cycle, and the write sequence 
finishes in the third cycle. To model such complex operations, we 
differentiate between combinational and sequential multi-cycle 
peradons, A combinational operation has cycle synchronous 
inputs and outputs, so it is modeled by a single CDFG node. A 
sequential pperadon can have different cycle-by-cycle ii^nit/output 
connections and even resource requirements, and is modeled by 
several CDFG nodes that arc locked into consecutive csteps by a 
behavioral template. Fig. 5 shows the single-node and multiple- 
nodes-in-a-tcmplate nruxlels for the above 3-cyclc RAM write oper- 
ation. Note that in our model (Fig. 5(b)), the address and data inputs 
arc decoupled in terms of when and for how long each input must 
be stable. This also de-couples multiple outputs Of any). 

laV Vaddr r^^J^^^^ 
TT dat^.f^^ST/^ 



data^ 



|^dcUys3 



datK^ 



(a) 



nCTOES. Models for • 3Hcycle RAM write lotion: (a) single node 
with delay e 3; (b) 3 nodes locked In atempU^ /•*«*«ciwoc 

This muUiple-nodes-in-a-templatc model is even more powerful 
when resource requirements are added. For example, a pipelined 



operation uses different pipe-suge in different cycles, allowing 

verlapping pipelined operations to share the same hardware mod- 
ule as long as they do n t have resource contenti n on any pipe- 
stage. If we view pipe-stages as internal resources and assign each 
pipe-stage a named ken", then we may label each node in the 
template model with the resource tokens it requires. As a node is 
scheduled to a cstep, we reserve its resource tokens for that cstep. 
T\)c number of conflicting tokens (i.e., number of non-mutually- 
exclusive ncxies that require the same token) in any cstep gives the 
number of pipelined modules needed in that cstep. Overlapping of 
pipelined operations can be scheduled on the same module because 
successive nodes in the template model require different tokens. 

This removes assumptions about pipeUned operadons from the 
schcduUng algorithms. We may now model operations on compU- 
cated pipelines, and template-based scheduUng wiU properly sched- 
ule these operations on the pipelined nwdulcs. Fig. 6 shows 
examples of operations on pipelines witii internal feedback, sequen- 
tial inputs, and multiple outputs. We use "a(sl]" to denote an pera- 
tion named "a" which requires tiie token "si". 





(a) 



(b) 



(d) 



HGURE Template models fon (a) basic 3-stage pipelined operation, 
(b) 3-cyde pipelmed operation with 2 sUges and Internal feedback, (c) 
4-cycle pipelined operation with 2 sUges and sequential Inputs, (d) 
pipelined operation using a different internal path and output porC 

Actually, resource tokens need not correspond to physical hardware 
resources, but may be considered a more general mechanism for 
specifying how different types of operations can overlap in time on 
the same module. Consider a 2-cyclc RAM which has one read-port 
and one write-port, whose read/write cycles must be synchronized 
Fig. 7 shows how we use resource tokens to specify tiiis constraint 
to scheduling. If two such operations are pre-assigned to the same 
RAM, then resource contention on any of "sP. "s2", "s3- and **s4" 
impUes an iUegal schedule. The token "s3" prevents read operations 
for the same RAM to overiap; tiie token "s4" prevents write pcra- 
tions for the same RAM to overlap; and tiic tokens "si" and "s2" 
prevent read and write operations for the same RAM from being 
scheduled exactiy one cycle apart. 



raddrN 







\ 














waddr 



(a) 



(b) 



HGURE 7. Template Models for RAM (a) 2<yde read, and (b) 2<ycle 
write. 

To handle multi-port RAM*s. we allow a module to cany more than 
1 copy of a given resource token. For example, to model a 4-port 
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RAM where each port can be used for both read or write, we would 
define the RAM module I hav 4 "r/w" tokens, and model read and 
write operali n n this RAM to require I "r/w" token each. This 
would all w scheduling t perform up to 4 simultaneous read or 
write operations on the same module. 

S3 Pre-Chaining 

Just before scheduling, we selectively force operation chaining by 
locking operations in the same cycle using behavioral templates. 
This **prc<haining" step reduces scheduling complexity at the 
expense of scheduling freedom. User-specified chaining directives 
arc applied in this step. We also implement automatic pre<haining 
for logic operations and bit-manipulation operations to save regis- 
ters. 

Logic operations include bit- wise AND, OR, NOT, EXOR opera- 
tions, and bit-manipulation operations include bit-cxuact, bit-con- 
catenate, constant bit/word generator operations. These operations 
are good candidates for prc-chaining because they have small prop- 
agation delays and they are not resource shared. Thus pre-chaining 
can be done on the basis of register costs alone. We implement a 
greedy algorithm for prc-chaining: 

1 Jn a forward traversal of the data flow graph, pre-chain a logic/bit- 
manipulation operation with its predecessors if there are fewer out- 
put bits than input bits: 

2. In a reverse traversal of the data flow graph, pre-chain a logic/bit- 
manipulation operation with its successors if there are fewer input 
bits than output bits. 

3. Iterate until there are no more changes. 

Fig. 8 shows examples of good pre-chaining configurations. 

V 




© 

@(u2) logic opcraUons 



^^^^^^^^^^ W successoi; (b) 

ttr»«ztciisira wlt^ (c) bU<ztract with prtdcccssoc; (d) 

multMnput logic wUh predecessor or mulU^tput logic^s^ccessor 
5^ Hierarchical Scheduling 

As shown in Fig. 3. our extracted CDFG is hierarchical, in which 
each level of the hierarchy corresponds to a loop or a subroutine. 
Hierarchical icheduUng proceeds in a bottom-up traversal of the 
hicrardiy. At each level, instead of representing subgraphs as super 
nodes, we inUne each subgraph and use a behavioral template to 
mtaiodc the inlined nodes according to the subgraph's schedule. 

Inlining subgraphs allows certain boundary optimizations. First, 
unused subgraph outputs (and the operations that produce these out- 
puts) can be deleted This deletion can recurs to unused subgraph 
inputs and then to operaUons that feed these inputs. Second. inUn- 
tng subgraphs allows scheduUng of neighboring nodes to take 



advantage of when individual subgraph inputs/outputs are actually 
required/produced, whereas representing subgraphs as super nodes 
would force scheduling to assume that alt subgraph inputs/outputs 
are required/produced in the same cycles. 

Another advantage of inlining subgraphs is that scheduling main- 
tains accurate cyclc-by-cyc!e resource costs. This allows, for exam- 
pie, to calculate resource costs of scheduUng operations in the first 
and last cycles of a loop subgraph. (When an outside operation is 
scheduled in the first/last cycle of a loop it is performed when enter- 
ing/exiting the loop). 

In fact, hierarchical scheduUng is used to implement sequential 
rnulti-cycle operations. In the initial CDFG, each sequential opera- 
tion is a subroutine call to some library function, which is a pre- 
schedulcd CDFG whose nodes are labelled with the required 
resource tokens. During inlining, each sequential operarion is 
replaced by an inlined copy of its function's CDFG, and a template 
is created to lock these inlined nodes. This creates the muldple- 
nodes-in-a-tcmplaie model for sequential operations. 

The disadvantage of inlining subgraphs is that more nodes are 
scheduled instead of a small number of super nodes. This is bal- 
anced somewhat by the fact that inUncd nodes arc grouped by tem- 
plates into a few scheduling units, so at least the scheduling 
solution space is not much bigger. 

6.0 Results 

Behavioral templates have been implemented in the Synopsys 
Behavioral Compiler™ product. Behavioral Compiler™ inputs a 
VHDL or Verilog behavioral description, performs scheduling, 
allocation, module selection, binding, and control optimization! and 
outputs a RTL design which is then opdmized by RTL optimization 
[2], FSM optimization, and logic synthesis. 

We will use "dft", a discrete fourier transform design, to illustrate 
behavioral templates. On reset, dft sequenUally reads in the real and 
imaginary parts of the coefficients into arrays cmem and dmem. 
These arrays are mapped to the memory XRAM" (cmem in the 
lower bank, dmem in the upper bank). It then enters the main pro- 
cessing loop. In each iteration, dft signals it is ready for processing, 
do a busy wait for the "start" signal, and then sequentiaUy reads in. 
the real and imaginary parts of the data points into arrays amcm and - 
bmem, which arc also mapped to two halves of a memory, 
"DRAM". It then enters two nested FOR loops which compute the 
discrete fourier transform values and write them out The memories 
are two<ycle RAM's whose read/write models are shown in Rg. 7. 
The multiply operations are done on a 2-stage pipelined multiply. 





:<s:o: 



(b) 

FIGURE 9. Handshaking for start signal: (a) oridoal CDFG with 
timing constraints, (b) final CDFG scheduled 
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Fig. 9 shows the CDFG fragment for the busy-wail on the "start" 
signal Fig. 9(a) shows the initial CDFG containing the fixed con- 
straints that the "ready** signal be asserted one cycle before the busy 
walu and deasserted 0 cycle after the busy wait Fig. 9(b) shows the 
same CDFG fragment that is finally scheduled. By this time, hierar- 
chical scheduling has already scheduled the busy wait loop, and the 
loop body is inlined and locked in a template. Also, pre<haining 
has locked the constants with the write operations. 

H wcver, the main scheduling problem is in the inncr-nwst loop, 
which reads from memories the complex data (a , j6), and coefii- 
dent fc, and computes pjum ^= \a*c - b*d) and ipsum -fe (a*d 
^ b*c% l^le 1 shows the scheduling and allocation results for the 
computation part of this loop. Note the pipelined RAM read's are 
chained with the pipelined multiply operations. 
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TABLE 1. Scheduling^aUocaUon results for dH's Inner loop 

FIGURE 10. ScheduUng^allocatioa result for compuUtioa part of dft*s 
inner-most loop 

In Table 2 and 3, we present some design statistics. #line is number 
of VHDiyverilog lines in the source. #Ioop is number of loops with 
nesting levels in brackets. #node is number of CDFG nodes. #tem- 
platc is total number of templates scheduled. The ratio of nodes to 
templates arc shown in brackets. #RAM is the number of on-chip 
memories used. #gate is gate count after logic synthesis (excluding 
RAM's), Note the difference between #node and ^templates. 

TW)1 2 presents several HLSW benchmarks, modified to use more 
realistic bit-widths. EWF is the fifth order ellipdc wave filter exam- 
ple (20 csieps for the main loop, using 1 16x16 pipelined mulUply, 
2 32-bit adders, 10 32-bit registers and 1 16-bit register). KF is the 
Kalman filter modified to use 5 RAM's. 
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TABLE 3. Design i UtisUcs for several Industrial examples 

TMt 3 lists several industrial examples. Compared to benchmark 
examples, these designs tend to: 

• have more complicated reset sequences before the main pro- 
cessing loop. 



• have more cycle-by<ycle liming constraints n lO operaUons, 

• have more logic operations and bil-manipulaUon perations 

• use multiple RAM's r multi-port RAM's to improve RAM 
access bottlenecks. 

• use pipelined operations to increase throughput 
7.0 Conclusion 

In this paper, we have presented our work on scheduling using 
behavioral templates. The most important value of behavioral tem- 
plates is that they enable simple solutions to the problems of (1) 
enforcing fixed and maximum timing constraints, (2) modeling 
complex sequential operations, (3) prc<haining of logic and bit- 
manipulation operations, and (4) hierarchical scheduling. For this 
reason, behavioral templates have been instrumental in our produc- 
lization of behavioral synthesis. 

Future work will invesUgatc adding structural templates to partition 
the design based on structural regularity. 
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Abstract 

This paper describes a HDL synthesis based design methodology that 
supports user adoption of bchavioral-level synthesis into normal 
design pracdccs. The use of these techniques increases understanding 
of the HDL descripUons before synthesis, and makes the comparison 
of pre- and post-synthesis design behavior through simulation much 
more direct Hiis increases user confidence that the specification docs 
what the user wants, i.e. that the synthesized design matches the spec- 
ificadon in the ways that arc important to the user. At the same time, 
the methodology gives the user a powerful set of tools to specify com- 
plex interface liming, while preserving a user's ability to delegate 
decision-making authority to software in those cases where the user 
does not wish to restrict the options available to the synthesis algo- 
rithms. 

1,0 Overview 

This paper describes a synthesis methodology that uses high-level 
synthesis (HLS) of behavioral hardware-description language (HDL) 
dcscripUons. HLS has the disdnguishing characteristic that opcraUons 
arc automaticaUy scheduled, i.e. assigned to stales, as opposed to 
lowcr-level synthesis, in which operations arc assigned to states by the 
user [1, 2, 3]. For example, in an HDL description of a square root 
funcuon, an operand x would be loaded, a series of operations would 
foUow, and a single result r would be returned. The read z and the 
write r might be fixed to particular states or times by a communication 
protocol, but the internal operations that compute the square root 
would be automatically scheduled. 

^f.?!^''^"'^''^"^ "^'^ « number of qucsUons, TTiesc 

will Lkely include the following: 

• Howcanlconstrainl/OopcrationsiofaUintoparticularcycles or 
range ofcycles, to meet existing protocols? / 

• How can I constrain I/O operations to have particular timinc rcla- 
UonshiM? For example, how can I constrainVdata ^ysSlb^fo 
be synchronous with data on data poru? ^ 

• How can I be confident that my interface timing s^^ 
reaUy works with the surrounding hardware? 

Ues when my Ummg spedfication is not ricid^ For examine I 
might not care cxacUyVhen data was tr^tm±Xz^lcor 
remains synduom^ with^d^K?h?^- 

• Jo'^wMll^Ut"^^^^^ 



1. In the sense that it computes the right result, 

2. In the sense that scheduling of I/O operations docs not 
'break* its I/O protocols. 

These questions can be reformulated as requirements on the HDL 
description metiuxlology to be used in conjunction with HLS: 

• The original HDL description should be simulatable. 

• There should be a mode wherein the cycle by cycle I/O tim- 

of^^ ongjn^ HDL descriptioa is preserved cxactiy i e 
no L^O urning difference will beaUowed betwwn tiic^^^ 
post-syntiics« descriptions. This will allow St w«S?ari!? 
son, on a cycle bv cycle basis, of the pre- and posi-syn&iesis 

cycle-based timing protocols. * 

• There should be a mode wherein timing relationships 
between I/O signals can be simply and easily preserved 
across syntiiesis, but where 'stretching' (cycle level delay 
insertion) is permitted, so tiiat tiie user do^ not have to sW 
H^^u^ /y*^i * computation will take. This 
mode should allow manual constrainu. Such a mode allows 
comMHson of pre- and post-syntiiesis I/O timing between 

similar points^of the pre- anS post-syntiiesis wa^S 

• There should be a mode in which tiie user explicitly specifies 
all uming «>nstraints witiwut reference to Uic simulation 

^mV»2r«n^^x"^^' ^ ^'^y ^"^"g constrainu inferred 
irom the HDL descnpaon are ordering constraints amone I/O 
hflS^i3Sl f^"^-® P^'V This mode gives Uic greatest ffcxi- 
ni^'^^/^'" <>P.^«".".tion and for s^fication of complex 
Uniingrclations[ups;itisalsothemost5fficuUtous^ 
We call tiicse tiucc modes tiie cycie^fixed 10 scheduUng mode. 
the superstate-fixed 10 scheduUng mode, and the fiee-ftoasing 10 
scheduling mode respectively. Each has consequences for the 
style of HDL description and vaUdation methodology. These 
modes give the user a wide range of choices in specifying I/O 
Unung, witii a corresponding range of ways in which vaUdation 
of tiie specification atKi comparison of tiie implementation with 
tiie specification can be performed. 

1 .1 Stnjcture of this paper 

The balance of tiiis paper is structured as foUows, In Section 1 X 
related woric in tiiis field is discussed FoUowing that, in Section 
2. some mode-independent considerations and assumptions are 
dcsCTibed. In Section 3. tiie cycle-fixed mode is described in 
detail Then in Section 4, tiic superslate-fixcd mode is described 
In Section 5, tiie free-floating mode is described. In Section 6, 
cxpenencc wttii tiie current software is described; finaUy, in Sec- 
tion 7 tiie paper is summarized and conclusions are drawn. 

1.2 Related Work 

High-l velsyntiiesis has been well described in the literature; 
sec. f r example, Caniposano[l]. Gajski[2I. Macrz(3J. These 
tutorial papers describe tiie basics f HLS systems. CALLAS (4] 
describes work in tiie area of maintaining simulated behavior tiiat 
is cxacUy tiie same pre- and post-syntiiesis; tiiis idea is reflected 
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in ihc cycle-fixed mode described here. The superstate-fixed mode 
is related the High Level Sute machine of of [5], and to the behav- 
ioral finite state machines (BFSM's) of (6]. Our approach of valida- 
tion through simulation is typical of current industry pracdcc; it 
complements, but cannot completely replace, more formal methods 
17]. 

2.0 Basic assumptions 

The circuit to be synthesized by HLS consists of a collection of 
always blocks (VHDL processes); each always block will be 
mzpped to hardware consisting of a datapath and a control FSM. 
Each will be synthesized separately. 

Control over timing makes use of clocking statements in the source 
HDL, In Verilog. this can be done by use of @(po$edge ck)ck) or 
@(negedgc clock) statements*. These arc used to separate I/O 
vents that arc to happen in different clock cycles. Event triggers 
using other signals are specifically disallowed, with the exception 
of as3mchronous reset and a special gating methodology described 
in Section 12, used for synchronizing I/O, 

2.1 Reset 

In order to handle resets in an intuitively appealing way, we call 
attention to the always block (VHDL process) that will be sched- 
uled. In our methodology this block contains a single alUncom- 
passing, nonterminating loop, here called reset Joop. 

always begin: bl 
begin: resctjoop 

// reset sequence behaviors 
forever begin 

// normal mode behaviors 

end 
^ end . 
end 

Inside reset Joop is a reset sequence', this consists of all behaviors 
associated with reset For example, in a microprocessor the reset 
sequence would clear the program counter, disable interrupts, and 
initialize the stack pointer, nie reset sequence may contain many 
clock cycles, e.g. to initialize a RAM. Following the reset behavior 
is the •normal mode* loop, which does not terminate cither, this 
loop contains behaviors that arc executed until the next reset 
occurs. In a microprocessor, for example, the normal mode loop 
would be the fetch / execute cycle. 

In order to simulate the effect of synchronous resets conccUy in the 
source HDL description, the user must insert a statement of the 
form^ 

If (rc$ct= rbl) disable ieset_Ioop; 
after every @posedge statement Tlus disable has the effect of 
restarting the block (process) foUowing a clock edge upon which 
reset IS found to be tnie. Simulation of synchronous resets can be 
matched both pre- arxl post-syiithesis. 

Another capability can also be provided in which the user declares a 
reset pin to the synthesis software, which then synthesizes the reset- 
but because the reset behavior is not encoded in the HDL, resets ' 
cannot be simulated correoly before synthesis using this technique. 
ScheduUng cannot handle exits triggered by a reset in the same way 
as thcr exits, because there may be rcad.before.write accesses in 



the HDL. Consider the following: In this siniation. the assignments 
begin: reset Joop, 

putRort <== x; // X is read before write! 
begin: m^njoop 

@(posedge clock); 

{re»et = Tbl) disable reset Joop; 
X — vz, 
end 
end 

of X cannot be rescheduled, because this would change the observ. 
able behavior of the circuit immediately following a reset pulse. If 
for example, the second write to jrwas rescheduled before ih clock 
edge, then the output immediately following a reset pulse would be 
v2 in the scheduled design; but it would be v/ in the original 
description. So if we are to aUow read before write in the HDL, we 
must either relax the requirement that all behaviors must be idenu*- 
cal, or we must forbid movement of such side effects across clock 
boundaries. Side effects on variables that are always written before 
they arc read arc not affected. 

2.2 Registered outputs 

VHDL signals and Verilog reg variables behave like register or 
latch outputs. That is, they hold their values once set For imple- 
mentation reasons, we chose to register all outputs of HLS synthe- 
sized designs; thus a nonblocking (signal) assignment becomes a 
register write. This has the consequence that responses to external 
events cannot happen unUl the cycle after the external event as 
shown in Fig. 1. 

Figure 1 shows the behavior of a synthesized circuit where the HDL 
input is of the general form 

if (Ready = Fbl) then Data <= foo; 
@(posedge clock); 

This timing corresponds to both input and output Nodce that this 



Qock 




1. In VHDL^rait until clock'event and clock = • IT gives us a ris- 
mg-edge dock. 

1. InVHDLthisw uld be ^Vhen reset « T exit reset Joop-. 



Fig. 1. Response to an external event 

timing diagram impUes that the control FSM for the synthesized 
dau path is a Mealy machine; and that the overall synthesized 
design is a Moore machine. 

Here is an example combining an asynchronous reset and a com- 
pact busy wait on a data strobe. 

while (strobe != 1) begin 

^(posedge clock or posedge reset); 
^ If(reset=rbl) disable resetjoop; 

3,0 Cycle-fixed mode 

High-level synthesis in cycle-fixed mode can be described by the 
following statement: 

• Cycle-by-cycle I/O timing is identical between the pre-and 

post-synthesis designs. 
This means that validauon by simulation is straightforwaid: a user 
need merely simulate the pec- and post-synthesis designs side by 
side, and check f r differences in the outputs. Alternatively, the 
synthesized design can be inserted into the original lest bench with- 
out modifying the test bench. The only differences that are visible 
involve combinational delays in the form of settip and hold times; 
for example, a delta-delay setup lime w uld become a real setup ' 
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time and a registered output pin will not transition cxaaly on the 
clock edge, as it would in the pre-synthcsis simulation'. This is 
shown in Fig. 2. 

Nou'ce that this mode only constrains the I/O operations of the 
design. That is. the reads and nonblocking (signal) writes of the 
HDL are tied to particular cycles. But this still leaves optimization 
opportunities for the scheduling alg rithm: other operations (e.g. 
additions, memory operations, and register reads and writes) can be 
shifted in time, as long as they consume daU after it has been read 
in, and produce data in time to write it out The I/O operations pro- 
vide a series of 'stakes in the ground* that define time frames within 
which all other operations are free to move. 

aock 

' T„>0^ 

Strobe 
Data 

Fig. 2a. Simulation of specified design (pre-synthesis) 





Fig. 2b. Simulation of synthesized design (post-synthesis) 

Fig. 2. Comparison of simulation in cycle-fixed mode. 

The main advantage of cycle-fixed mode is that the user can synthe- 
, size exactly the same timing diagram that the original HDL specifi- 
cation shows in simulation; thus, if the simulated HDL specification 
w rks in a particular context, then the synthesized design will also 
work, assuming only that setup, hold, and propagation delays, etc. 
as shown in Fig. lb meet the clock cycle time. 
A further advantage of cycle-fixed mode is that simulation of a 
xero-gale-delay model of the synthesized design will match the 
original specification exactly; hence a simple file difference pro- 
gram can be used to compare pre- and post-synthesis designs. This 
is expected to have a profound effect on user acceptance of HLS as 
a viable tool in the design cycle: users are able to simply and effi- 
denUy check the equivalence of designs before and after synthesis. 
There are a number of methodological and implementation consid- 
craUons that affect the way we can write and implement cycle-fixed 
mode. These will now be described. 

3.1 Numbers of clock edges 

One consequence of the commitment to maintain exact I/O equiva- 
lence in cycle-fixed mode is that numbers of clock edges cannot be 
varied inside the scope of loops and conditionals. Tb do so would 
distort the I/O timing of the design. 



1. In zero-delay simulation one should ensure that data transitions 
occur slighUy after clock transitions; failing to do this is the most 
common source of simulation mismatches. The problem comes 
about because of varying numbers of simulation-cycle delays in th 
clock and dau wires of the circuit: the clock can arrive 'afler' the 
data by an infinitesimal (zero-time) amount This causes something 
analogous to a setup-time violation. 



3.2 Loop boundaries 

Every loop of an always block must contain at least on clock edge 
Stat ment The only exception to this is loops with constant itera- 
tion bounds, which can be unrolled during synthesis. 
A loop carl be thought of as a subgraph of a fimte-state machine 
(FSM) which forms a cycle. The syntiicsized design will enter Uiis 
cycle when the loop is executed, and leav it when the loop is 
exited. Such a loop is shown in Fig. 3. 
ol <= vl* 

whlle^c) begin: loop 

end clock); 
o3<=v3; 
@{posedge clock); 

!c/vl,v3 




Fig. 3. Loop and corresponding slate graph 



The loop of Fig. 3 corresponds to the state labeled 'Loop*. I>uring 
each pass of the loop, the value of v2 will be written to the output 
porto. 

The main consequence of matching this behavior is the splitting of 
tiie conditional test c Notice tiiat it was necessary, in order t cap- 
ture the timing of tiie original, to have a state transition tiiat 
bypassed the loop altogether if c was false when it was first tested 
This means tiiat tiie test must be performed in two places: once in 
state p/rv, and once in state ioop. In general, it is necessary to unroll 
tiie first state of tiie first pass tiirough a while loop in order to cap- 
ture this behavior concctiy. 

If we wish to avoid unrolling tiie first pass, tiien it is necessary to 
rewrite tiie loop so tiiat ( I ) tiiere is a clock edge on all patiis 
between tiic writes of ol and o3, and (2) tiierc is a clock edge 
between tiie conditional test and any succeeding I/O, as shown in 
Fig. 4. 

ol <= vl- 

while (c) begin: loop 

@(posedgc clock); 

^o2<=v2; 
end 

@(posedge clock); 
o3 <= v3; 




Rg. 4. Loop that does not need partial unrolling. 

3.3 Conditional multicycle operations 

A muitkycU operation is one tiiat has a longer combinational delay 
than the clock cycle. This imposes special constraints on synthesis 
in cycle-fixed mode, because it is necessary to stabilize all data and 
control inputs to tiie hardware block tfiat implements tiie multicycle 
operation. This includes all the control inputs of all multiplexers 
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that drive multicycle operations; clearly we cannot afford glitches 
on these paths. 

But inserting these registers means that we need to know what to 
strobe into the registers one cycle bef re the multicycle operation 
is to begin. Thus we need to add extra time, under some circum* 
stances, so that the stabilizing registers can be properly loaded. 
This is illustrated in Fig. 5; we assume 



many limes, with varying numbers of clock edge statements each 
time, looking for the best implementation. 



@(posedge clock); 
if(ir- ' -• 



(mput.signal = Tbl) begin 
X = input.readl; 
y = mpui_rcaal2; 
imp = X + y; // 2 cycle addition 
@(posedge clock); // strobe stab rcgs 
@(posedge clock); // 1 st cycle of add 
@(posedge clock); // 2nd cycle of add 
out <=: tmp; 

end 

@(posedge clock); 
fig. 5. HDL description for a multicycle addition. 

Notice that we needed three clock cycles to do this properly: one 
to get the condition and strobe the stabilizing registers, and two to 
perform the mulu'cyclc addition. Notice also that such delays can 
often be hidden, where the multicycle operations are not con- 
strained by I/O; but that in this case there is no opportunity to 
hid the additional delay associated with stabilizing the inputs. 

3.4 Loop pipelining in cycle-fixed mode 

Loop pipelining is a technique whereby a loop can be made to act 
like a pipeline. Thus the loop has a relatively long latency, i.e. the 
dmc from a data input to the corresponding data output; and a 
shorter inidation interval, which is the rate at which data can be 
delivered to and read out from the loop. In cycle-fixed mode, and 
with some extra constraints in the other modes, a simple way to 
imply loop pipelining while maintaining timing equivalence is to 
use a delayed assignment (in VHDL, a transport delay) on the 
ouquu stalemenL Suppose, for example, we have a loop whose 
latency is ten cycles, but whose initiation interval is two cycles; 
we can put an output write after the second clock edge statement, 
with a delay of eight cycles. This wiU simulate the same way both 
before and after synthesis. 

while (condition) begia 

@(posedg€ clock); // 10 ns clock 
@(posedgc clock); 

out <= #80 value; // delayed by 8 cycles 
end 

4.0 Superstate-fixed Mode 

The ii^nceue.^^ I/O models used where the I/O should 
inherit its general stniciure firom the HDL, but where there is 
some freedom to shift I/O operations in time. Considei; for exam- 
ple, the two-wire handshaking protocol shown in Rg. 6. 
The twcHwire protocol is insensitive to the time between transi- 
tions; this makes it Ideal for many applications. In a case like this, 
the only things wc reaUy need to assure in order to have correct ' 
timing are that (1) the signal transitions occur in the right order, 
and (2) that the transiUons of Strobe and Data maintain a lockstcp 
relationship. Beyond that, the user might not care veiy much how 
many clock cycles were inserted by scheduling; other design opti- 
miration criteria (such as the number of gates to compute the data 
value) might dictate more or f wer clock cycles f r this transac- 
tion. The cycle-fixed mode is unsuitable for this kind of loosened 
specification of timing: the user could be forced to edit the code 



Request- 



Strobe 
Data . 



—xzzzzrxzz: 



Fig. 6. TWo-wire handshaking protocol. 

The superstate-fixed I/O scheduling mode can be expressed by the fol- 
lowing statements: 

• Adjacent pairs of clock edge slatemenU in the HDL form the 
boundaries of superstates. 

• All I/O operations in a superstate remain in that superstate. 

• A supersute may be expanded by the scheduler, which can add 
clock cycles to lengthen a superstate. 

• All I/O writes in a superstate will always take place in the last 
clock cycle of the superstate. 

• I/O reads may float within a superstate. 




Hg. 7a, Simulation before superstate-fixed scheduling. 




ssl ss2 ss3 

Rg. 7b. Simulation after superstate-fixed scheduling. 
These rules, taken together, mean that an HDL scheduled in superstate 
mode will show the same signal transitions and ordering as the rigi- 
nal HDL; but that the original timing may potentially be 'stretched' by 
the addition of new clock edges. This is illustrated in Fig. 7, where the 
original HDL simulation of an I/O transfer taking three cycles has 
become five cycles long by the addition of two extra cycles to the sec- 
ond superstate. 

4.1 Protocols In superstate mode 

One of the major advantages of superstate mode is that handshaking 1/ 
O protocols arc not distorted by the addition of clock cycles to super- 
states. This has two beneficial conseqences: first, comparison of simu- 
lated pre- and post-synthesis designs is straightforward; and second, 
protocols that are insensitive to increased numbers of clock cycles will 
not be 'broken' by superstate scheduling. Hence if a design consists of 
many processes, each of which is to be scheduled, the use of hand- 
shakiiig communication in conjuncUon with superstate mode schedul- 
ing will ensure that the design will continue to work after synthesis. 
The same considerations apply to the simulation test bench as well: 
the lest bench must communicate with the synthesized design(s) via 
handshaking protocols; otherwise it may have to be modified to com- 
municate successfully with th synthesized design. This happens 
because the read and write operations occur at difiFerent times pre- and 
post-synthesis; the test bench must be able to tolerate this, or the user 
will have to retime the test bench. 

Protocols that do not involve explicit requests and acknowledges can 
still be used; but care must be taken with data to be read in by the syn- 
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i^szedl process. In particular, recall that read operations may move 
freely within their superstate. This means that dau being presented 
to the synthesized circuit must be either valid during the entire 
superstate in which it is read, or Ise retimed after scheduling. This 
will ensure that the read operati n always gets the correct data. 

4.2. Constraints in superstat mode 

The reason a designer would use supersute mode instead of cycle- 
fixed mode is that some part of the schedule docs not have a fixed 
timing bound, and the user does not want to imply such a bound by 
using cycle-fixed I/O. However, the user may have a non-handshak- 
ing protocol, or a protocol that streams data once synchronization 
has been established by the protocol. In such cases the parU of the 
schedule that perform synchronization may need to be handled as if 
the scheduler was in cycle-fixed mode; while the other paru of the 
design can be allowed more freedom. For example, consider the 
fragment 

while (ready = 1 'bO) begin: handshaking^Ioop 
@(posedge clock); 

end 

^(posedge clock); 

al = in Dort; // label read_l 

@(poseage clock); 

a2 = in Dort; // label rcad_2 

€*(poseage clock); 

out_port <= long_involvcd_function(al, a2); 
out_ready <= rbl;//label done 
©(posedge clock); 

Here the external logic provides the data for read J and r€ad_2 in 
the two cycles after the signal reedy goes true; the synthesized sys- 
tem must pick it up then, or the protocol will be broken. Further- 
more, insertion of extra cycles in the loop handshakingjoop will 
cause the interface to behave unpredictably. Thus cycle-fixed mode 
would seem to be indicated. However, suppose that there is no need 
for the output to show up until 20 cycles after the input has been 
delivered; the designer will thus want to allow the scheduler author- 
ity to add cycles to the last superstate, and rely on a test of the oar_- 
ready pin to synchronize the data on out _port. Thus stretching can 
be allowed m the last superstate, but not in the fint three. 
This can be done by means of explicit point-to-point scheduling 
constraints; that is, constraints that tie two labeled operations 
together in a particular timing relationship. A constraint set that 
would serve the purpose is 

1. Thetimcfromthebeginriingof /iandlf/id(ing_iboptoitsen^ 
should be exactly one cycle. 

2. The time from the end of handshaking Joop to the beginning of 
nadjt should be exactly one cycle. 

3. The tinae from the end of handshakingjoop to the data ready 
$U£bcdone is no greater than 21 cycled. 

Notice that these constraints are not part of the HDL; but they are a 
necessary part of the methodology. They can be implemented as 
pseudoK»niments, as attributes, or as directives in a separate sched- 
uler conmiand file. Notice also that they can be applied to non-I/0 
perations as well, in all three modes, to give the user a little extra 
control over the scheduling process. 

4^3 Superstate HDL methodology 

Superstate mode defines superstates as containing the I/O opera- 
tions that fall between adjacent pairs of clock edge statements. This 
dcfiniti n has the consequence that sometimes an HDL prepared f r 
superstate nKxle needs clock edge statements that are not needed in 
cycle-fixed mode. For example, the text of Fig. 3 is ambiguous 
when the HDL is considered as input for superstate mode. This 



comes about because two writes are separated by a conditional 
©posedge. If the loop condition is true, then the writes should be in 
different superstates; if it is false, then they should be in the same 
superstate. Clearly there is no unique staUc assignment f I/O oper- 
ations to superstates in this situati n. 

Furthermore, there is an implicit ordering f operati ns conferred 
by the sequencing of the HDL text; this ordering cannot be aUo wed 
to come into conflict with the ordering conferred by the migration 
of reads into any cycle of their superstate and writes into the last 
cycle of their superstate. 

The HDL methodology niles that prevent ambiguities and contra- 
dictions in superstate mode are: 

1 . A superstate that contains a loop conu'nue is called a continuing 
superstate. Impliddy, the last supenUte of a loop is also a con- 
tinuing superstate A continuing superstate and the first super- 
state of the loop are really the same superstate; there is no clock 
statement on the execution path going from one to the othec If a 
condnuing superstate contains a write, then the first state f the 
loop cannot contain any I/O. because a write belonging to the 
continuing superstate would be migrated to the end of the first 
loop superstate; this would result in a violation of the HDUs 
ordering constraints. 

2. A superstate that contains a loop beginning cannot include both 
an I/O write before the loop beginning and any I/O operation 
inside the loop. For example. 

@(posedge clock); 

out Dort<= write 1 data; 

while (cond) begin 

readLdau = in_port; // Illegal! 

@(posedge clock); 

end 

the write in this fragment conflicts with the read in the begin- 
ning of the loop; they arc in the same superstate. 

3. A write cannot precede a while loop that is succeeded by any 1/ 
O opcraaon, unless there is a clock edge statement between 
either the write and the loop begin, or between the loop end and 
the second I/O operation. 

4. A loop having a superstate in which both a loop cxit^ and an I/O 
write are located must have a clock edge statement between the 
loop end and the next I/O operation. 

5. A conditional clock edge (e.g. an @cdge on one branch fa 
conditional) cannot be used to separate a write from another I/O 
operation. This fragment is illegal for that reason. 

out_port<= vl; 

if (cond) @(poscdge clock); 

v2 = in_port; 

5.0 Free-floating I/O mode 

It wiU sometimes be the case that a user will need to convey more 
freedom to the scheduler than is allowed by the superstate I/O 
mode. For example, the user may wish to allow two uruelated 
writes to be permuted. Consider the fragment of Fig. 8. 
In this situation, the user might not care whether the first or the sec- 
ond function happens first; indeed, they could be interleaved and 
the user might not care. But neither superstate nor cycle-fixed mode 
will permit permutation of I/O operations and waits; so a more 
powerful nKKle is needed. 



1. Other than a reset exit Reset exits can be ignored after a prepro- 
cessing step in which they are detected and global reset behavior is 
enacted, as explained in Section 2, 
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Thcfrre-flaating mode is characicrizcd by implicit constraints on 
al =in_portl; 
a2 = in Dort2; 
^(posedge clock); 

outj)ort_l <= Iong_funciion_I { al, a2 ); 
@(posedge clock); 
bl =in_port3; 
b2 = in_port4; 
@(posedge clock); 

oul^rt_2 <= Iong_function_2 ( bl, b2 ); 
Fig. 8. Writes to oulj>oft_l and out ^rt_2 may be permuted, 
single I/O poiu and explicit user constraints. 
Implxctt I/O port constraints are derived dirccUy from the HDL text 
and arc imposed on the sets of reads and writes that occur on a sin- 
gle port These arc formed into partially ordered sets, one for each 
port, where the ordering is derived from a staUc execution trace 
analysis of the source HDL The schedule constnictcd by synthesis 
can only transpose two members of one of these sets if there is no 
ordering relationship between them. 

This, however, says nothing about ordering of reads and writes that 
occur on different ports, which must be expUcitly constrained by 
the user, by means of the explicit two-point constraints described in 
Secdon 4.2. 

For example, in our experience a common early mistake in free- 
floating mode is to expect a dau strobe's timing to be fixed with 
respect to that of the data being strobed. This will not necessarily be 
the case if the user does not issue explicit constraints. 
The downside of this mode is the number of expUcit constraints that 
the user must construct This can easily be comparable in numbers 
of lines to the HDL input itself. In addition, it is very easy to get 
such constraints wrong, or to forget a crucial constraint; hence the 
cycle-fixed and superstate modes arc simpler and less error-prone to 

liSC 

6,0 Experience 

Support for the methodologies discussed above has been built into a 
commcrda! product the Synopsys Behavioral Compiler(TM). This 
product is cunenay in use at a number of sites. Of these, about half 
use Vcrilog as their input HDL; the rest use VHDL 
Experience to date indicates that the superstate mode is usually the 
most convenient from the standpoint of ease of specification of 
complex timing behaviors. The next most convenient is usually the 
Qrcle-fixcd mode. The reason for this is that the power of the free- 
floating mode comes at the price of manually added constraints- 
while the cycle-fixed mode requires the user to add clock cycles to 
the source HDL when. e.g., the duraUon of a particular loop is to be 
changed. 

From the standpoint of ease of validation of results, the cycle-fixed 
mode is usuaUy a litUe more convenient than the superstate mode. 
Tlus IS because the handshaking protocoU necessary to get the 
design talking to the test bench after superstate-mode scheduling 

must be designed and written in both the test bench and the spedfi- 
canon; or alter^vely the test bench liming must be modificdio 

matdi the schedule of I/O of the post-synthesis design. - 
One area in which the free-floating mode seems to be more conve- 
menl than the others is in that of expIoraUon. Here the user is more 
micrested m getung a rough idea of the cost and speed of a desien 
or algorithm, than in gctUng its interfaces exacUy right In this wn- 
text the case of turning the design around and the high degree of 
freedom from methodological constraints makes it simpler to 
Chang the design and resynihesize to see what the overall results 
are. TTien when the general outlines of the alg rithms, represenu- 



tions, etc. are clear the user can begin to worry about the detailed V 
O timing. 

The overall effort of geiUng I/O interfaces right using these three 
modes IS usually less than the effort spent in getting the best possi- 
blc quahly of results. Even with behavioral synthesis, HDL writing 
styles sull can have a large impact on the quality of the synthesized 
circuit. Examples that can affect synthesis quality are: loop order- 
ing, assignment of variables and arrays to memories, choice of loop 
pipeline imtialion intervals and latencies, pipelined components 
embedding combinational logic in reusable function blocks the' 
tradeoff between mulucycle operations and fast clock rates, 'and the 
parmiomng of the design into datapath/controller subunits (i e 
always blocks; in VHDL, processes). All are potendally of great 
importance to the quality of results, and all represent true engineer- 
mg deasions that must be carefully considered if a really good 
design is to be achieved. 

7,0 Conclusfon 

We have presented HDL methodologies for the synthesis of various 
kinds of I/O timing and protocols, and for simulation-based valida- 
tion of the synthesized design against the original spccificaU n. 
Three modes of scheduling I/O operations have been presented: 
1 • Cycle-fixed, in which the design has exacUy the same cycle- 
level I/O timing before and after synthesis; 

2. Su^iatc-fixed. in which I/O operations are grouped by pairs 
of @posedge statements; posl-synlhesis Uming behavi r is a 
(potenually) stretched version of the pre-synthesis timing; and 

3. Free-floating, in which the only constraints on I/O scheduling 
arc either between operations sharing a port or supplied by the 
user. 

Some of the implications of the scheduUng modes werc described 
In the cycle-fixed and superstate modes, these involve the place- 
ment of clock edge statements, loop boundaries, conditionals, and U 
O operations; while in the free-floating mode there are no rules of 
this kind. 

Experience with production software which implements these 
methodologies has been described, and conclusions based on that 
experience have been drawn. 
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August 18, 2000 

VIA FEDERAL EXPRFSS 

Mr. David W. Knapp 

Chief Technical Officer 

Get2Chip.com, Inc. 

2107 North First Street, Suite 350 

San Jose, CA 95131 

Re: Reissue of U.S. Patent No. 5,764,95 1 to Ly et al. for METHODS FOR 
AUTOMATICALLY PIPELINING LOOPS 
OurRef: 4000/10 

Dear Mr. Knapp: 



This is a follow up to our letter of 3 1 July 2000 requesting your review and signature for 
the above-identified reissue patent application. As we stated in our earlier letter, this law firm 
represents your former employer Synopsys, Inc. (hereinafter "Synopsys"), which is currently 
seekmg reissuance of the above-identified patent in the United States Patent and Trademark 
Office. Enclosed please find a Declaration of Inventor which we will be filing in the Reissue 
Patent Application. Also enclosed is a copy of the Reissue Application, which adds new claims 
23-34 to those which already issued in U.S. Patent 5,764,951. Please review the Declaration and 
the new claims carefully. The Declaration and Reissue Application are identical to those sem to 
you on 3 1 July. We have enclosed additional copies for your convenience. 

Assuming you understand and agree with the Declaration, Synopsys requests that you do 
the following. Please complete the Declaration by entering, in the boxes at the end of the 
Declaration, your country of citizenship and the address of your current home residence Once 
you have completed the Declaration, please sign the Declaration in the indicated box at the end 
of the Declaration. Return the Declaration to us by means of the enclosed self-addressed 
stamped envelope. 

IT IS VERY IMPORTANT THAT YOU RETURN THE DECLARATION TO US 
BY SEPTEMBER 1, 2000. IF YOU HAVE ANY QUESTIONS ABOUT THE 
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Mr. David W. Knapp 
August 18, 2000 
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DECLARATION, IT IS EXTREMELY IMPORTANT THAT YOU CONTACT US AS 
SOON AS POSSIBLE SO THAT WE MAY ANSWER YOUR QUESTIONS. 

As I am sure you will recall, you have previously assigned all rights in the invention to 
Synopsys as your former employer. 

Your cooperation in this matter is greatly appreciated. On behalf of Synopsys, Inc I 
thank you most sincerely for your help. 

Very truly yours, 



Jonathan T. Kaplan 
JTK:MJM:od 

Enclosures Declaration Of Inventor (w/ self-addressed stamped envelope) 
Reissue Application 
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Application Serial No. 09/590,584 
In re Patent No. 5,764,951 
Patentees: Lyetal. 
Attorney Doelcet No. 4000/10 

IN THE UNITED STATES PATENT AND TRADEMARK OFFTrF. 



Reissue Appl. No. : 09/590,584 

Filed : June 8, 2000 

In re Patent to : Tai A. Ly et al. 

Patent No. : 5,764,951 

Issued : June 9, 1998 

Title : METHODS FOR AUTOMATICALLY PIPELINING LOOPS 
Box REISSUE 



Assistant Commissioner for Patents 
Washington, D.C. 20231 

DECLARATION OF INVENTOR DAVID W. KNAPP IN APPLICATION 
FOR BROADENING REISSUE OF PATENT 
Pursuant to 37 CFR 1 .63 and 1 . 1 75 

This declaration is made in application for broadening reissue of the above-identified 

patent. 

As a below named inventor, I hereby declare that: 

My residence, post office address, and citizenship are as stated below next to my name. 

I believe I am an original, first and joint inventor of the subject matter which is claimed and for 
which a patent is sought on the invention entitled 

I • METHODS FOR AUTOMATICALLY PIPELINING LOOPS | 

the specification of which: (check one) 
[~] is attached hereto; or 

^ was filed on June 8, 2000 as U.S. Reissue Application Serial No. 09/590,584. 

I hereby state that I have reviewed and understand the contents of the above identified 
specification, including the claims, as amended by any amendment referred to above. 

I acknowledge the duty to disclose information which is material to patentability as defined in Title 
37, Code of Federal Regulations, §1.56. 
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Application Serial No. 09/590,584 
In re Patent No. 5,764,951 
Patentees: Lyctal. 
Attorney Docket No. 4000/10 

I believe the original patent to be wholly or partly inoperative or invalid, for the reasons described 
below: 

I I by reason of a defective specification or drawing. 

^ by reason of the patentee claiming more or less than he had the right to claim in the patent. 
I I by reason of other errors. 

At least one error upon which this application for reissue is based is described as follows: 

The limitation of claim 1 to methods comprising steps of parsing text descriptions including loops 
with delayed signal assignments having delay values and setting latencies of pipelines equal to said 
delay values is more limiting than necessary, and resulted in the patentee claiming less than he had 
a right to claim. 

The limitation of claim 18 to systems comprising logic for parsing text descriptions including loops 
with delayed signal assignments havmg delay values and setting latencies of pipelines equal to said 
delay values is more limiting than necessary, and resulted in the patentee claiming less than he had 
a right to claim. 

The limitation of claim 2 1 to computer program products comprising computer readable program 
code devices configured to cause a computer effect parsing of text descriptions including loops 
with delayed signal assignments having delay values and setting of latencies of pipelines equal to 
said delay values is more limiting than necessary, and resulted in the patentee claiming less than he 
had a right to claim. 

All errors being corrected in this reissue application arose v^thout any deceptive intention on the 
part of the applicant. 

I hereby declare that all statements made herein of my ovm knowledge are true and that all 
statements made on information and belief are believed to be true; and further that these 
statements were made v^th the knowledge that vAWM false statements and the like so made are 
punishable by fine or imprisonment, or both, under Section 1001 of Title 18 of the United States 
Code and that such willful false statements may jeopardize the validity of the application or any 
patent issued thereon. 



Full Name of First Joint Inventor David W. Knapp 


Inventor's Signature 




Date 


Residence 




Citizenship 


Post Office Address 
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METHODS FOR AUTOMATICALLY PIPELINING LOOPS 

Related Applications 

This application is related to U.S. patent application Ser. No. 08/440,101 entitled 
"Behavioral Synthesis Links to Logic synthesis" with inventors Ronald A. Miller, 
Donald B. MacMillen, Tai A. Ly and David W. Knapp filed on May 12, 1995, which is 
hereby incorporated by reference. 

Background 
Field of the Invention 

This invention relates to the field of computer aided design for digital circuits, 
particularly to automatically pipelining loops in a behavioral synthesis system. 

Statement of the Related Art 

Behavioral Synthesis 

Behavioral vs. Register Transfer Level Design 

Many of today's integrated circuits are described using a Hardware Description 
Language (HDL). Two common HDL's are VHDL and Verilog. VHDL is described in 
the IEEE Standard VHDL Language Reference Manual available from the Institute of 
Electrical and Electronic Engineers in Piscataway, New Jersey which is hereby 
incorporated by reference. Verilog is described in The Veriloe Hardware Description 
Language by Donald E. Thomas and Philip Moorby, Kluwer Academic Publishers, 
1991 which is hereby incorporated by reference. 

As integrated circuits become increasingly complex, hardware designers are 
increasingly using synthesis software to transform HDL descriptions of digital circuits 
into mapped logic. The designer writes a description of a digital circuit in VHDL, 
Verilog, or another HDL, and uses synthesis software to create a digital circuit from the 
description. Using synthesis software typically shortens the amount of time required to 
create a digital circuit from a design specification, and allows a designer to create more 
complex designs than is possible manually. 

Many of today's complex designs are expressed as software descriptions and 
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simulated to verify their correctness. These designs are later translated from software 
into hardware, in the form of Integrated Circuits (ICs), Application Specific Integrated 
Circuits (ASICs), or Field Programmable Gate Arrays (FPGAs), for implementation in 
the final product. This design description methodology is called algorithmic-level 
5 design. 

Instead of beginning design at the Register Transfer Level (RTL). behavioral 
synthesis begins at the algorithmic (behavioral) level. RTL level design is described in 
Compute r Structures: Reading and Examp les by C. Gorden Bell and Allen Newell, 
McGraw-Hill 1971. A behavioral hardware description language (HDL) specification 
1 0 contains instructions, operations, variables, and arrays similar to the original software 
algorithm. 

The target architecture of behavioral synthesis is a general computing model that 
contains datapath, memory, and control elements. Conventional design techniques 
currently use a manual RTL design methodology to build a datapath. A datapath is a 
1 5 sequence of logic consisting of registers, higher order functional units (such as adders 
and multipliers), and multiplexers. The datapath in a digital circuit uses the circuit's 
inputs to compute output results. Registers are 1-bit memory elements which hold their 
value through each clock cycle. 

Conventional design techniques also build a controller at the RTL to sequence 
and control the actions of the datapath, memory, and Input/Output (I/O). Frequently, 
such controllers are implemented using a Finite State Machine (FSM). Finite state 
machines are described in Switching and Finite Automata Theory by Zvi Kohavi, 
Computer Science Press, 1978 which is hereby incorporated by reference. Controllers 
may also determine actions such as which branch of a conditional statement is executed. 

Behavioral synthesis builds this architecture by using automated methods of 
scheduling, allocation, register sharing, memory and control inferencing-all of which 
are performed manually in an RTL methodology. The designer is freed from having to 
specify the exact architecture of a design and can automatically explore many 
implementations to find the optimal architecture. 
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Components of Behavioral Synthesis 

The High-Level Synthesi s of Digital Systems by Michael McFarland, Alice 
Parker, and Raul Camposano, in Proceedings of the IEEE, February 1990, which is 
hereby incorporated by reference, provides an excellent overview of High Level 
Synthesis, as Behavioral Synthesis is often called. 

Three components of a behavioral synthesis system are Scheduling, Allocation, 
and Resource Sharing. 

Scheduling determines in which clock cycle each operation executes. 
Scheduling extracts the control and data flow operations of a design specification and 
assigns these operations to cycles. A state machine controller is synthesized to sequence 
the operations and execute them in their assigned cycle. The typical goal of this process 
is to assign operations to cycles so as to be able to implement the design with the fewest 
resources (registers, multiplexers, and operations) while at the same time minimizing 
the number of clock cycles (latency). 

Allocation is a behavioral synthesis task that maps the operations and data of a 
behavioral HDL specification into the datapath, which contains memories, registers, 
functional units such as adders and multiplexers, and gates. Allocation determines 
which type of operation to use for each operator. For instance, if an operator performs 
addition, a ripple cany, a carry-lookahead, or some other type of adder can be used. 

Resource Sharing attempts to share hardware resources between operators in a 
design. For example, consider two additions which occur in mutually exclusive 
conditional branches. Such additions will never be performed at the same time. Thus, 
they can be performed on the same piece of hardware. Resource sharing attempts to 
minimize the amount of hardware used by sharing hardware as much as possible. 
Scheduling Modes 

There are several modes for automatically scheduling operations into control 
steps. Briefly, these modes are cycle-fixed, superstate-fixed, and fi-ee-floating mode. In 
cycle-fixed mode, all I/O operations are constrained to occur in the same cycle in the 
original HDL descriptions and in the synthesized design. In cycle-fixed mode, the cycle 
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level behavior of the synthesized circuit must match the cycle level simulation behavior 
of the source HDL. 

The other scheduling modes allow behavioral synthesis a greater degree of 
freedom in assigning states in a schedule. Scheduling modes are discussed further in 
5 Behavioral Synthesis Methodology for HDL-Based Specification and Validation hy n 
Knapp, T. Ly, D. MacMillen and R. Miller in Proceedings of the 31st DAC, June 1995, 
which is included as Appendix B and is hereby incorporated by reference. They are also 
discussed in Behavioral Compiler User Guide Version 3.2a available from Synopsys, 
Inc. in Mountain View, Calif, which is hereby incorporated by reference. 
10 Loop Pipelining 

In behavioral HDL, a loop repeatedly executes the operations in the loop body 
until an exit condition becomes true. Loop iterations are usually sequential; operations 
in the first iteration are executed, operations in the next iteration are executed, and so 
on, as shown in Figure 1. The throughput, that is the amount of data processed per unit 
15 time, of the function implemented by the loop body is limited by the critical path in the 
loop body. 

In some loops, data required by an operation in the next loop iteration is 
available prior to completion of the current loop. Under these conditions, the designer 
can pipeline the loop-parallelizing execution of iterations to increase throughput 
20 beyond critical path limitations of the loop body. This process of loop pipelining 

schedules consecutive loop iterations to partially overlap in time; a new loop iteration is 
initiated before the current iteration has finished. 

Figure 2 shows an example of loop pipelining where the data required by 
operation A in iteration two is available after operation C in the first loop iteration. 
25 The two timing-related aspects of a loop that affect throughput are: 

Initiation interval: The number of clock cycles between the start of two 
consecutive loop iterations. 

Latency: The number of clock cycles required to execute all operations in a 
single loop iteration. 
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For sequential loops that are not pipelined, the initiation interval and latency of a 
loop are the same. For a pipelined loop, the initiation interval is smaller than the 
latency. 

The primary reason for using loop pipelining is to increase the throughput of the 
design; the trade-off is that the design area usually increases. 

Many designs have separate specifications on throughput and input-to-output 
delay. The throughput specification constrains the initiation interval. The input-to- 
output delay specification constrains the loop latency. Loop pipelining enables a 
flexible relationship between the initiation interval and latency of a loop. 

An example of a candidate for loop pipelining is a design that processes a data 
stream. This type of design often has tight throughput requirements based on the rate of 
the data streams and loose input-to-output delay constraints. 
Loop Carry Dependencies 

Loop Carry Dependencies (LCDs) are data values produced in one iteration of a 
loop and consumed by operations in subsequent iterations. 

In loop pipelining, loop iterations that are producers and consumers of LCDs 
can happen at the same time. To preserve data dependencies, the operations in a loop 
must be scheduled so that LCD values are available in time for the iteration in which 
they will be consumed. Two schedules for a LCD are shown in Figure 3. 

The example of Figure 3(a) violates the LCD. Operation 410 is scheduled so that 
its output is not ready in time for operation 420 to use it in the next iteration of the loop. 
The example of Figure 3(b) is scheduled correctly. In this case, operation 410 is 
scheduled so that its output is ready in time for operation 420 to execute in the next 
iteration of the loop. 
Memory and I/O Accesses 

Loop Pipelining must preserve the original ordering of all reads and writes to the 
same memory, signal, or port. In addition, the ordering reads and writes in one iteration 
of the loop may not "cross," or occur after, reads and writes in subsequent iterations of 
the loop. Specifically, all reads and writes to the same memory, and all writes to the 
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same signal or port in one iteration of the loop must occur before any reads or writes to 
the same memory, signal or port in a subsequent iteration of the loop. All reads of the 
same signal or port must occur simultaneously to or before any read of the same signal 
or port in a subsequent iteration of the loop. 

For example, Figure 4 shows two schedules for a loop that has two reads of 
signal X. In FIGURE 4(a), read 510 and read 520 are improperly scheduled. Read 520 
occurs after read 510 occurs in the next iteration of the loop. In FIGURE 4(b), read 510 
and read 520 are properly scheduled. In this schedule, read 520 occurs after read 510 in 
the next iteration of the loop. 
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A Brief Description of the Drawings 

The accompanying drawings, which are incorporated in and constitute a part of 
this specification, illustrate several embodiments of the invention and, together with the 
description, serve to explain the principles of the invention. 
5 Figure 1 shows an example of sequential loop processing. 

Figure 2 shows an example of pipelined loop processing including the loop 
latency and initiation interval. 

Figure 3 shows an example of a loop carry dependency. 

Figure 4 shows an example of memory and I/O access restrictions in pipelined 

10 loops. 

Figure 5 is a block diagram showing a computer system. 
Figure 6 is a flowchart which shows steps in a circuit synthesis process. 
Figure 7 is a flowchart which shows steps for scheduling preprocessing. 
Figure 8 is a flowchart which shows steps for inserting constraints into a 
15 constraint graph. 

Figure 9 is a flowchart which shows steps for scheduling templates. 
Figure 10 is a flowchart which shows steps for creating a constraint using 
templates. 

Figure 1 1 shows HDL source code which contains a loop with a producer and a 
20 consumer. 

Figure 12 shows a circuit before scheduling which is created from loop 3030 of 
Figure 11. 

Figure 13 shows a constraint created for a producer and consumer in loop 3030. 
Figure 14 shows a circuit which is created after scheduling loop 3030 using an 
25 initiation interval of 2 and a latency of 4. 

Figure 15 shows Verilog HDL source code which contains a loop with I/O 
dependencies. 

Figure 16 shows a circuit before scheduling which is created from loop 1530 of 
Figure 15. 
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Figure 17 shows a constraint created for two reads in loop 1530. 
Figure 18 shows a circuit which is created after scheduling loop 1530 using an 
initiation interval of 2 and a latency of 4. 

Figure 19 (a) and Figure 19 (b) are examples of HDL source code including a 
5 delay clause. 

Figure 20 is a flowchart showing steps performed during translation fiom the 
source code of Figure 19 (a) and Figure 19 (b) to a circuit design that incorporates a 
delay specified by the delay clause. 

Figure 21 is a representation of a data flow graph generated from the source 
10 code of Figure 1 9 (a) and (b) in accordance with the steps of Figure 20. 

Figure 22 is a representation of a control flow graph generated from the source 
code of Figure 19 (a) and (b) and the data flow graph of Figure 21. 

Figure 23 is a flow chart showing steps performed to generate a control data 
flow graph fix>m the contnal flow graph and data flow graph of Figure 21 and Figure 22. 
15 Figure 24 is a representation of a control data flow graph generated by the steps 

of Figure 23. 

Figure 25 is a diagram showing an example of loop tiling with and without the 
delay in the HDL. 

Figure 26 is a diagram showing the effect of the delay clause on pipelining. 
20 Figure 27 shows the operations of Figure 1 2 scheduled into control steps. 

Figure 28 shows the read operations of Figure 16 scheduled into control steps. 
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Detailed Description of the Invention 

The present invention is a method and apparatus for synthesizing a circuit which 
implements a pipelined loop from a Hardware Description Language (HDL) 
description. The following description is presented to enable any person skilled in the 
5 art to make and use the invention, and is provided in the context of a particular 

application and its requirements. Various modifications to the preferred embodiment 
will be readily apparent to those skilled in the art, and the generic principles defined 
herein maybe applied to other embodiments and applications without departing from 
the spirit and scope of the invention. Thus, the present invention is not intended to be 

10 limited to the embodiment shown, but is to be accorded the widest scope consistent with 
the principles and features disclosed herein. 
1.0 Computer System Description 

Figure 5 illustrates a computer system 100 in accordance with a preferred 
embodiment of the present invention. The computer system 100 includes a bus 101, or 

15 other communications hardware and software, for communicating information, and a 
processor 109, coupled with the bus 101, is for processing information. The processor 
109 can be a single processor or a number of individual processors that can work 
together. The computer system 100 further includes a memory 104. The memory 104 
can be random access memory (RAM), or some other dynamic storage device. The 

20 memory 104 is coupled to the bus 101 and is for storing information and instructions to 
be executed by the processor 109, The memory 104 also may be used for storing 
temporary variables or other intermediate information during the execution of 
instructions by the processor 109. The computer system 100 also includes a ROM 106 
(read only memory), and/or some other static storage device, coupled to the bus 101. 

25 The ROM 106 is for storing static information such as instructions or data. 

The computer system 100 can optionally include a data storage device 107, such 
as a magnetic disk, a digital tape system, or an optical disk and a corresponding disk 
drive. The data storage device 107 can be coupled to the bus 101. 

The computer system 100 can also include a display device 121 for displaying 
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information to a user. The display device 121 can be coupled to the bus 101. The 
display device 121 can include a frame buffer, specialized graphics rendering devices, a 
cathode ray tube (CRT), and/or a flat panel display. The bus 101 can include a separate 
bus for use by the display device 121 alone. 

An input device 122, including alphanumeric and other keys* is typically 
coupled to the bus 101 for communicating information, such as command selections, to 
the processor 109 from a user. Another type of user input device is a cursor control 123, 
such as a mouse, a trackball, a pen, a touch screen, a touch pad, a digital tablet, or 
cursor direction keys, for communicating direction information to the processor 109, 
and for controlling the cursor's movement on the display device 121 . The cursor control 
123 typically has two degrees of freedom, a first axis (e.g., x) and a second axis (eg., 
y), which allows the cursor control 123 to specify positions in a plane. However, the 
computer system 100 is not limited to input devices with only two degrees of freedom. 

Another device which may be optionally coupled to the bus 1 01 is a hard copy 
device 124 which may be used for printing instructions, data, or other information, on a 
medium such as paper, film, slides, or other types of media. 

A sound recording and/or playback device 125 can optionally be coupled to the 
bus 101. For example, the sound recording and/or playback device 125 can include an 
audio digitizer coupled to a microphone for recording sounds. Further, the sound 
recording and/or playback device 125 may include speakers which are coupled to digital 
to analog (D/A) converter and an amplifier for playing back sounds. 

A video input/output device 126 can optionally be coupled to the bus 101. The 
video input/output device 126 can be used to digitize video images from, for example, a 
television signal, a video cassette recorder, and/or a video camera. The video 
input/output device 126 can include a scanner for scanning printed images. The video 
input/out-put device 126 can generate a video signal for, for example, display by a 
television. 

Also, the computer system 100 can be part of a computer network (for example, 
a LAN) using an optional network connector 127, being coupled to the bus 101. In one 
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embodiment of the invention, an entire network can then also be considered to be part 
of the computer system 100. 

An optional device 128 can optionally be coupled to the bus 101. The optional 
device 128 can include, for example, a PCMCIA card and a PCMCIA adapter. The 
optional device 128 can further include an optional device such as modem or a wireless 
network connection. 
2.0 DeflnitioBS 

A digital circuit is an interconnected collection of parts. Parts may also be called 
cells. The digital circuit receives signals from external sources at points called primary 
inputs. The digital circuit produces signals for external destinations at points called 
primary outputs. Primary inputs and primary outputs are also called ports. Each part 
receives input signals and computes output signals. Each part has one or more pins for 
receiving input signals and producing output signals. In general, pins have a direction. 
Most pins are either input pins, which are called loads, or output pins, which are called 
drivers. Some pins may be bidirectional pins, which can be both drivers and loads. 

Two or more pins from one or more parts or primary inputs or primary outputs 
are connected together with a net. Each net establishes an electrical connection among 
the connected pins, and allows the parts to interact electrically with each other. Pins are 
also connected to primary inputs and primary outputs with nets. For the sake of 
simplicity, parts may be said to be "connected" to nets, but it is actually pins on the 
parts which are coimected to the nets. 

A Circuit Element is any component of a circuit. Ports, pins, nets, and cells are 
all circuit elements. Any circuit element which is an input to another circuit element is 
said to drive that circuit element. Any circuit element which is an output of another 
circuit element is said to load that circuit element. For example, drivers drive a signal 
onto a net; loads load nets with capacitance. 

A digital circuit design can be stored in memory of a computer system using 
data structures which represent the various components of the circuit. The data 
structures have the same name as the physical components. In this document, parts, 
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cells, nets, pins, and other digital circuit components refer to the software representation 
of the physical digital circuit component. 

A digital circuit can be specified hierarchically. Some or all of the parts in the 
digital circuit may themselves be digital circuits composed of more interconnected 
5 parts. When a high level part is specified as a digital circuit composed of other, lower 
level parts, the pins of the high level part become the primary inputs and primary 
outputs for the digital circuit comprising the lower level parts. When a high level part is 
composed of lower level parts, it is called a level of hierarchy. 

Following are additional definitions of terms which are used in this document. 
10 An HDL is a Hardware Description Language. HDL*s are used to describe 

designs for digital circuits. 

A Translated Circuit, Generic Technology Circuit, or GTech Circuit is a 
software representation of a digital circuit which does not include references to a 
specific technology, but rather refers to cells that implement generic logic such as 
15 "and", "or", and "not". This software representation is stored in memory 104 of 
computer system 100. 

A Mapped Circuit is a software representation of a digital circuit which is built 
fi-om parts available in a technology library which is provided by a silicon vendor. This 
software representation is stored in memory 104 of computer system 100. A mapped 
20 circuit can be timed using a conventional timing verifier such as DesignTime, available 
from Synopsys, Inc. in Mountain View, Calif. After it is built, a netlist representation of 
a mapped circuit can be sent to a silicon vendor for layout and fabrication. For instance, 
the mapped circuit can be written out using LSI netlist format and sent to LSI Logic in 
Milpitas, Calif. The process of creating a mapped circuit from a generic technology 
25 circuit is called mapping. Because a circuit must be mapped before it can be timed, 
mapped circuits are also used internally by synthesis tools. 

The Fanout of a circuit element includes any circuit elements which are driven 
by that circuit element. The transitive fanout of a circuit element includes all of the 
circuit elements in the circuit which are driven, either directly or indirectly, by that 
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circuit element. Thus, the transitive fanout of a circuit element includes the fanout of 
that circuit element, as well as the fanout of each of the circuit elements in the original 
fanin, and so on. 

The Fanin of a circuit element includes any circuit elements which drive that 
5 circuit element. The transitive fanin of a circuit element includes alfof the circuit 
elements in the circuit which drive, either directly or indirectly, that circuit element. 
Thus, the transitive fanin of a circuit element includes the fanin of that circuit element, 
as well as the fanin of each of the circuit elements in the original fanin, and so on. 

An Operator is a function, such as addition. Such functions are used in HDL 
10 source code. For example, the plus in "c=a+b;" is an operator. 

An Operation is a software representation of a hardware functional unit which 
performs a function such as addition. For example, a software representation of an 
adder is an operation. 

A Clock Cycle is a period of time, for example 10ns, between pulses of a 
15 clocking element in a digital circuit. The clocking element is used to synchronize the 
digital circuit. 
3.0 Scheduling 

Scheduling is a well defined problem which has been studied extensively. An 
overview of the scheduling problem is available in The Hieh-Level Svnthesis of Dieital 
20 Systems by Michael McFarland, Alice Parker, and Raul Camposano, in Proceedings of 
the IEEE, February 1990, which is hereby incorporated by reference. 

The input to a scheduler is typically a set of hardware operations, a set of 
constraints between the hardware operations, a clock period, and a set of control steps 
into which the hardware operations must be mapped. The output is a schedule where 
25 each hardware operation is mapped to a control step. 

Schedulers typically use a number of graphs. For instance, the constraints for a 
scheduler are often represented using a graph. Nodes in the graph typically represent 
events to be scheduled, such as operations, and edges in the graph represent constraints 
between the events. The scheduler checks the constraint graph to ensure that all of the 
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constraints are met before placing an event into a particular control step. Schedulers 
also use control graphs, data flow graphs, and combination control data flow graphs 
(CDFG's). Control graphs represent the flow of control in a circuit. Data flow graphs 
represent the flow of data in a circuit; that it the flow of data from the inputs to the 
5 outputs of the circuit. Control data flow graphs combine both control flow and data flow 
information into a single graph. All of these types of graphs are described in High-Level 
Synthesis (subtitled Introduction t o Chin and System Design ^ by Daniel Gajski, Nikil 
Dutt, Allen C-H Wu, and Steve Y-L Lin, Kluwer Academic Publishers, 1992 which is 
hereby incorporated by reference and will subsequently be referred to as High-Level 
10 Synthesis by Gajski et al. 

An additional technique used for scheduling circuits involves "templates". 
Templates are described in Scheduling using Behavioral Templates by Tai Ly, David 
Knapp, Ron Miller, and Don MacMillen in Proceedings of the 31st DAC, June 1995, 
which is included as Appendix A and is hereby incorporated by reference. Simply 

15 speaking, templates are data structures which specify scheduling constraints among 
CDFG nodes. Templates "lock" the control step relationship between 2 or more CDFG 
nodes. Figure 13 shows an example of two templates, template 1250 and template 1280. 
Each template contains one or more nodes, some of which may represent operations. 
For example, adder node 2020 represents adder 3120 of Figure 12. 

20 3.1 Overview of Synthesis with Scheduling 

Figure 6 is a flowchart showing how scheduling steps fit into the overall 
synthesis strategy. This flowchart shows how a mapped circuit is created from a source 
HDL description. The input to synthesis is an HDL description of a digital circuit Such 
a description may be written in VHDL, Verilog, or some otiier HDL. 

25 An HDL description is translated in step 8 1 0 to generic logic. A conventional 

HDL translator 1310 such as VHDL Compiler version 3.2b from Synopsys, Inc. in 
Mountain View, Calif preferably is used. 

Step 820 performs scheduling preprocessing steps. These steps are shown in 
Figure 7 and Figure 8. 
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Step 830 schedules the operations in the circuit. A method for scheduling the 
operations in the circuit is shown in Figure 9. 

Step 840 netlists the scheduled circuit. Netlisting creates a GTech circuit from 
the scheduled CDFG. The CDFG representation of the circuit in memory is transformed 
5 into a GTech representation of the circuit in memory. 

In step 850, the resulting GTech circuit is optimized using conventional logic 
synthesis such as Design Compiler version 3.2 b by Synopsys, Inc. in Mountain View, 
Calif. The output of logic optimization is a mapped circuit description which can be 
sent to a silicon vendor for fabrication. For example, a description of the mapped circuit 
10 can be output using LSI Netlist format and sent to LSI Logic in Milpitas, Calif for 
fabrication. 

3.2 Scheduling Preprocessing 

Figure 7 is a flowchart which shows steps for scheduling preprocessing. The 
input to the method is an annotated GTech circuit. Annotations on the circuit include 
15 delayed signal assignment information. The use of delayed signal assignments will be 
discussed in a later section. 

Step 910 extracts a control graph from the annotated GTech using conventional 
techniques. In addition, infomiation concerning delayed signal assignments is extracted 
as described below. 

20 Step 920 extracts a Control Data Flow Graph (CDFG) from the control graph 

created in step 910 and the data flow graph represented by the GTech circuit. This is 
also done using conventional techniques. 

Step 930 creates initial templates for the operations in the CDFG as described in 
Scheduling Using Behavioral Templates in Appendix A. These initial templates form 

25 the initial constraint graph. 

Step 940 inserts constraints in the constraint graph. Some types of constraints 
are discussed in Scheduling Using Behavioral Templates in Appendix A. Other types of 
constraints are a part of the present invention and will be discussed in subsequent 
sections. 
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3.3 Inserting Constraints 

Step 940 of Figure 7 is implemented by Figure 8 which is a flowchart which 
shows steps for inserting constraints into a constraint graph which uses templates. The 
input to the process is a CDFG and a constraint graph. 

Step 1110 identifies Loop Cany Dependency (LCD) producer consumer pairs. 
LCD's are identified by tracing the CDFG using conventional techniques. LCD's are 
discussed below in connection with Figure 11, Figure 12, Figure 13, Figure 14, and 
Figure 27. 

Step 1 120 constrains the LCD's. Constraining LCD's involves adding constraints 
to the constraint graph so that producer and consumer operations are scheduled so that 
the consumer consumes a value produced by the producer before it is overwritten in a 
subsequent iteration of the loop. A method and apparatus for constraining LCD's will be 
discussed in a later section. 

Step 1 130 identifies memory and I/O access dependencies in loops which will 
be scheduled using pipelines. I/O accesses include reads and writes to memories, 
signals, and ports. Reads and writes in one iteration of the loop may not "cross," or 
occur after, reads and writes in subsequent iterations of the loop. Specifically, all reads 
and writes to the same memory, and all reads and writes to the same signal or port in 
one iteration of the loop must occur before any reads or writes to the same memory, 
signal or port in a subsequent iteration of the loop. The one exception to this rule is that 
reads of the same signal or port may occur simultaneously to a read of the same signal 
or port in a subsequent iteration of the loop. This step finds the first and last accesses for 
each memory, signal, or port by tracing through the CDFG using conventional 
techniques. Memory and I/O accesses are discussed below in connection with Figure 
15, Figure 16, Figure 17, Figure 18, and Figure 28. 

Step 1 120 constrains the memory and I/O accesses in pipelined loops. 
Constraining memory and I/O accesses involves adding constraints to the constraint 
graph so that first and last accesses are scheduled so that the last access occurs before 
the first access in a subsequent iteration of the loop. A method and apparatus for 
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constraining memory and I/O accesses will be discussed in a later section. 

Step 1 130 inserts other types of constraints into the constraint graph. Such 
constraints are discussed in Scheduling Using Behavioral Templates in Appendix A. An 
example of another type of constraint is a dataflow constraint, which ensures that data 
5 values are produced before they are consumed by subsequent operations. 
3.4 Scheduling Templates 

Figure 9 is a flowchart which shows steps of scheduling (step 830 of Figure 6) 
using templates. The input to the process is the CDFG and the constraint graph created 
by the steps of Figure 8, It is possible to schedule templates using many different 
10 scheduling techniques. A number of scheduling techniques are described in High-Level 
Synthesis by Gajski et al, particularly in Chapter 7. This figure shows a general method, 
which is provided as an example. 

Step 1010 creates the As Soon As Possible (ASAP) and As Late As Possible 
(ALAP) schedules for each template while satisfying the constraints represented in the 
15 constraint graph. The ASAP schedule places each template into the earliest possible 
control step (c-step). The ALAP schedule places each template into the latest possible 
control step. Together, the earliest and latest control steps define a range into which 
each template may be scheduled. A method for determining the ASAP and ALAP 
schedules for templates is described in Scheduling Using Behavioral Templates in 
20 Appendix A. 

Loop 1020 loops until a "good" schedule is found. A "good" schedule is one 
which fulfills the constraints specified in the constraint graph and optimizes for a 
specific goal specified by a human designer, such as fewest number of control steps. 
Different scheduling techniques use different criteria for deciding when to stop trying to 
25 improve the schedule. For example, one technique might stop when the constraints are 
all met, or when a certain amount of CPU time has been spent, whichever comes last. 

Step 1030 picks a template in the constraint graph to schedule. Different 
techniques use different criteria for deciding what to schedule next. Generally, template 
scheduling techniques use criteria based upon the operations in a template. For instance. 
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a list scheduling technique which uses priorities will assign a priority to a template 
based on the priorities of the operations within the template, (List scheduling is 
described in High-Level Synthesis by Gajski et al in Chapter 7). 

Step 1040 schedules the chosen template in the control step chosen by the 
scheduling technique being used. Templates are scheduled by placing the first operation 
within the template into the chosen control step and the remaining operations within the 
template into subsequent control steps as defined by the template. 

Arrow 1050 indicates that loop 1020 iterates until a "good" schedule is found. 

4.0 Method for Creating Constraints 

This section describes a general technique for constraining the relationship 
between two nodes in a constraint graph. Such constraints are added in step 940 of 
Figure 7. The section then describes examples of using this technique to constrain loop 
carry dependencies and I/O dependencies. 

4.1 Placeholder Node Method 

Figure 10 shows a general method for creating a scheduling constraint between 
two nodes in a constraint graph.' Such constraints are created in step 1 120 and step 1 140 
of Figure 8 to constraint LCD's and memory and I/O accesses. This section shows a 
general method and discusses specific examples. The first example constrains an LCD; 
the second example constrains a pair of signal reads. The input to the process of Figure 
10 is a constraint graph, two templates in the graph. Event 1 and Event 2, an integer n, 
and a number of cycles c, "n" is the number of cycles within which Event 2 must be 
scheduled after Event 1. "c" is either 0 or 1. "c" has value 0 when Event 2 must be 
schedule before n cycles after Event 1, and value 0 when Event 2 may be scheduled 
exactly n cycles after Event 1. 

Step 610 adds a placeholder node H to the template for Event 1 in the constraint 
graph. A placeholder rode is a node in the constraint graph which is only used to create 
constraints. The placeholder node does not represent any portion of the final circuit. 
Placeholder node H is inserted into the Event I's template such that it is locked n cycles 
after Event 1. 
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Step 620 adds a constraint in the constraint graph from Event 2 to placeholder 
node H which constrains Event 2 to occur c cycles before placeholder node H, where c 
is 0 or 1. The value of c depends on the constraint being added and will be discussed in 
subsequent sections. 

4.2 Using Placeholder Nodes for Loop Carry Dependencies 

The following section provides an example of constraining loop cany 
dependencies using placeholder nodes. Such constraints are created in step 1 120 of 
Figure 8. A loop carry dependency is a data value which is produced in one iteration of 
a loop and consumed by . operations in subsequent iterations of the loop. To use the 
placeholder node method to schedule loop carry dependencies, Event 1 is set to be the 
operation which consumes the data. Event 2 is set to be the operation which produces 
that data. Event 2 must be scheduled so that the correct data values are driving it when it 
feeds its outputs to Event 1. If the consumer (Event 1) consumes the data one iteration 
after the producer (Event.2) creates it, then n is set to be the initiation interval of the 
loop. If the consumer consumes the data k iterations after it is created by the producer, 
then n is set to be k * initiation interval. For LCD's, "c" has value "1" because the 
producer must be scheduled before the consumer in the subsequent iteration of the loop. 

Figure 1 1 shows an example of Verilog source code for a loop 3030 with a loop 
carry dependency between addition 3020 and subtraction 3010. The output of addition 
3020, p, drives the input of subtraction 3010 on the next iteration of the loop, "p" is a 
Loop Carry Dependency. In this example, a human designer has specified that loop 
3030 will be scheduled using an initiation interval of 2 and a latency of 4. Although this 
loop would not usually be pipelined because pipelining does not increase its throughput, 
this simple example is used for the sake of clarity. 

Figure 12 shows a GTech circuit representation 2000 which is created for loop 
3030 in Figure 11. The GTech circuit representation is stored in memory 104. GTech 
circuit 2000 is output from step 810 of Figure 6. Addition 3020 is implemented as adder 
3120, and subtraction 3010 is implemented as subtracter 3110. Port p 2040 drives 
subtracter 3110. Port p' 2045 is driven by adder 3120. Port p 2040 and port pi 2045 are 
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partner ports. Partner ports are ports which represent the same signal, and thus 
frequently embody loop carry dependencies. Partner ports contain references to their 
partners. In the described embodiment, these references are implemented as pointers. 
Each port which has a partner contains a pointer to its partner port. 
5 Figure 1 3 shows a constraint 1 270 between adder node 2020^ which is the 

producer for this LCD, and subtracter node 2010 which is the consumer of this LCD. 
The consumer and producer were identified in step 1 1 10 of Figure 8. This constraint is 
created using the method of Figure 10. The starting templates are shown in Figure 
13(a). First step 610 of Figure 10 adds placeholder node H 2060 to the template 1250 of 

10 subtracter node 2010. Because the initiation interval for the loop is 2, placeholder node 
H 2060 is constrained to be 2 cycles after subtracter node 2010 by template 1250. Next, 
step 620 creates constraint 1270, represented by an arrow, which constrains adder node 
2020 to be at least one cycle before placeholder node H 2060. The modified templates 
and the new constraint are shown in Figure 13(b). The new constraint is then used to 

15 schedule the loop correctly using a method such as the one shown in Figure 9. 

Figure 27 shows the add and subtract operations of Figure 12 scheduled into 
control steps by step 830 of Figure 6. For the sake of clarity, the other operations in the 
circuit are not shown. Two iterations of the loop are shown, to demonstrate how the 
schedule properly handles the loop carry dependency. Adder 3120 is scheduled so thait 

20 its result is available before subtracter 3110 needs it in the next iteration of the loop. 

Figure 14 shows the circuit created from the Verilog HDL source code of Figure 
1 1 after scheduling. Block 3190 represents the representation of the FSM controller for 
this circuit stored in memory . 104. 
43 Using Placeholder-Nodes for I/O Dependencies 

25 Loop pipelining must preserve the original order of all reads and writes to the 

same memory, signal, or port. The placeholder node method can be used to create 
constraints which ensure that I/O accesses in different iterations of the loop do not cross 
one another. Such constraints are created in step 1 140 of Figure 8. The last I/O access to 
the same memory, signal, or port in a loop must occur simultaneously to or before the 
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first I/O access to that memory, signal or port in the next iteration of the loop. 
Specifically, reads of the same signal or port may occur simultaneously with reads in 
the next iteration of the loop, but not after. Writes to the same signal or port must occur 
before any read or write to the same signal or port in the next iteration of the loop. 
5 Reads and writes to the same memory must occur before any read or write to the same 
memory in the next iteration of the loop. 

Thus, any last I/O access must occur within the initiation interval of the first I/O 
or memory access. To create this constraint. Event 1 of Figure 10 is set to be the first 
I/O access to.a given memory, signal or port. Event 2 of Figure 10 is set to be the last 

10 I/O access to.a given memory, signal or port, n is set to be the initiation interval of the 
loop, and c is set to be 0 or 1, Specifically, c is set to be 0 if Event 1 and Event 2 are 
signal or port reads, c is set to be 1 if Event I or Event 2 are signal or port writes, or 
memory reads or writes. 

Figure 15 shows an example of Verilog source code for a loop 1530 with an I/O 

1 5 dependency between read 1 5 1 0 and read 1 520. Both read 1 5 1 0 and read 1 520 read the 
value of the same signal, x. Thus, read 1520 must be scheduled such that it occurs 
before read 1 510 in the next iteration of the loop. In this example, a human designer has 
specified that this loop 1530 will be scheduled using an initiation interval of 1 and a 
latency of 3. 

20 Figure 16 shows the GTech circuit 1500 which is created for loop 1530 of 

Figure 15. Circuit 1500 is output from step 810 of Figure 6. Read 1510 is implemented 
by read operation 3130. Read 1520 is implemented by read operation 3140. In this 
example, a human designer has specified that this loop will be pipelined with an 
initiation interval of 1 and a latency of 3. 

25 Figure 17 shows a constraint between read node 1610, the first read of x in loop 

1530, and read node 1620, the last read of x in loop 1530. Read node 1610 and read 
node 1620 were identified in step) 1 130 of Figure 8, This constraint is created using the 
method of Figure 10, First step 610 adds placeholder node H 1760 to the template 1750 
of read node 1610. Placeholder node H is constrained to be 1 cycle after read node 
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1610, because the initiation interval is 1, by template 1650. Next, step 620 creates 
constraint 1770, represented by an arrow, which constrains read node 1620 to be at least 
0 cycles before, that is in the same cycle or after, placeholder node H 1760. Read node 
1620 is constrained to be 0 cycles before placeholder node H 1760 because read node 
1620 and read node 1610 are both signal reads, and as such are allowed to occur in the 
same control step. Constraint 1770 is then used to schedule the loop correctly using a 
method such as the one shown in Figure 9. 

Figure .28 shows read operations on signal x of Figure 16 scheduled into control 
steps by step 830 of Figure 6. For the sake of clarity, the other operations in the circuit 
are not shown. Two iterations of the loop are shown, to demonstrate how the schedule 
properly handles the multiple signal reads. Read 3 130 is scheduled so that it occurs 
simultaneously with read 3140 in the next iteration of the loop. Since simultaneous 
signal reads are allowed, this is a legal schedule. 

Figure 1 8 shows the circuit created from the Verilog HDL source code of Figure 
11 after scheduling. 

5.0 Circuit Synthesis using Delayed Signal Assignment Information 

Conventional design methodology uses a simulator to verify the correctness of a 
design both before and after it is synthesized. Conventional simulation systems, 
especially those systems performing behavioral synthesis, do not always yield identical 
cycle timing characteristics when HDL source code is simulated and when a synthesis 
output (a representation of a synthesized circuit) is simulated. It is advantageous for 
behavioral synthesis to be able to infer a circuit which will have the same cycle by cycle 
behavior during simulation as the simulation of the source HDL. 

The source code of Figure 19(a) is written in the Verilog circuit specification 
language. The source code of Figure 19(b) is written in the VHDL circuit specification 
language. Both Verilog and VHDL are Hardware Description Languages (HDLs). 

In Figure 19(a), the Verilog source code includes a signal assignment statement: 

c<=#24x-p; 

This statement includes a delay clause ("#24") indicating that a delay of twenty- 
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four time units, e.g., nanoseconds, should pass before the write operation is performed 
by the circuit that is to be generated. The delay clause is an example of delayed signal 
assignment information. Note that the inclusion of the delay clause in the HDL indicates 
a delay of the write operation only. The delay clause does not cause a delay in the 
5 performance of the subtraction operation. Similarly, in Figure 1 9(b),* the VHDL source 
code includes a signal assignment statement: 
c<=transportX"p after 24 ns; 

This statement also contains a delay clause ("after 24 ns") indicating that a delay 
of twenty-four time units should occur in the generated circuit before the write 

10 operation is performed. This delay clause is a further example of delayed signal 
assignment information. 

A circuit loop generated from the HDL source code of Figure 19(a) and Figure 
19(b) will have an initiation interval of "2" because each source code example has two 
"wait" (or "posedge" or "negedge") statements within the loop. As discussed below, the 

15 delay clause in the source code causes the resulting loop to have a loop latency of "4", 
Figure 19(a) and Figure 19(b) are included for the purpose of example only. The present 
invention can use any appropriate type of source code (VHDL, Verilog, etc.) to 
represent a delay clause. 

Figure 20 is a flowchart showing steps performed during translation step 810 of 

20 Figure 6 to generate a cdb. The exact placement of the steps of Figure 20 are not a part 
of the present invention and the steps also can be performed, for example, in the 
preprocessing step 820 of Figure 6. The input to Figure 20 is a representation of one of 
the source code examples of Figure 19(a) and Figure 19(b), such as a parse tree 
generated from the source code. The steps of Figure 20 are performed for each 

25 statement in the source code. The output of the translation step 810 and Figure 20 is a 
data flow graph (a "Gtech circuit") and a control flow graph (a "control data base" 
(cdb)). It will be understood by persons of ordinary skill in the art that the steps of 
Figure 20 and Figure 23 are performed by processor 109 of Figure 5, performing 
instructions stored in memory 104 of Figure 5. 
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In Step 2002, the processor determines whether the current source code 
statement is a signal assignment statement (e.g., an assignment to a port using the "<=" 
operator) that includes a delay clause (e.g., "#24" in Verilog or "after 24 ns" in VHDL). 
If not, in step 2002, the processor performs standard processing for the node to build a 
5 node in the data flow graph. If the current source code statement includes a delay 
clause, then, in step 2004, the processor builds a write operation node in the data flow 
graph and annotates the node by adding an attribute indicating delayed signal 
assignment information to show that the write operation corresponding to the write 
operation node has a delay of, e.g., 24 nanoseconds (see node 2114 of Figure 21 and 
10 Figure 22). 

Figure 21 shows an example of a data flow graph 2100 generated from one of 
the source code examples of Figure 19(a) and Figure 19(b) in accordance with the steps 
of Figure 20. A representation of data flow graph 2100 is stored in memory 104. Data 
flow graph 2100 includes as inputs a port x, a register p, and ports y and z. Each port 

15 has zero or more read operation nodes ("read op") 2102, 2104, 2106 associated 

therewith and each read operation node has an attribute indicating a port name (e.g., 
"port= x'"). Respective ones of the inputs are input to a subtracter node 21 10 and an 
adder node 21 12. Subtracter node 21 10 is cormected to a write operation node 21 14, 
Adder node 21 12 is connected to a variable assignment node 2116. Output p' is input as 

20 p during successive iteration of the loop. Thus, the data flow graph of FIGURE 21 has 
seven nodes representing the data flow in the circuit to be synthesized. 

In step 2008 of Figure 20, if there are more statements in the source code, 
control returns to step 2002. If all statements have been processed and a data flow graph 
(including signal delay attributes) has been generated for the source code, control passes 

25 to step 2012, where a control flow graph, such as that in Figure 22 is created. 

Control graph 2200 of Figure 22 adds control information to nodes 2102, 2104, 
2106, 21 10, 21 12, 21 14, and 2116 indicating the order and conditions under which the 
data flow nodes are executed in the synthesized circuit. A representation of control 
graph 2200 is stored in memory 104 of Figure 5. The present invention preferably 
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operates in a "cycle fixed mode" in which each "wait" (or "posedge" or "negedge") 
statement in the source code indicates a new cycle in the synthesized circuit. Various 
processes for generating of control flow graphs are known to person of ordinary skill in 
the art and are described in High-Level Synthesis by Gajski et al. 
5 In Figure 22, cnodes are used as "placeholder" nodes in the control graph to 

represent a collection of data flow nodes. Thus, cnode 2200 is associated with write 
operation node 2114 (including the signal delay attribute), read operation node 2102, 
and subtracter node 2110. The wait nodes in Figure 22 are used to represent the 
transitions between each cycle (or "cstep"). A wait node 2204 is used to mark the 

10 transition between the first cstep (cstep 0) and the second cstep (cstep 1). Wait node 
2204 also has attributes indicating that it is based on a rising clock edge (due to the 
"posedge" statement in the source code) "Wait statements" (in VHDL source code) are 
treated similarly. Cnode 2206 (located in the second cstep) is associated with variable 
assignment node 2116, read operation node 2 1 04, read operation node 2 1 06, and adder 

15 node 2112; The control graph also includes a second wait node 2208 and a third cnode 
2210. 

As shown in Figure 7, the control flow graph is input to step 920, where a 
control data flow graph (CDFG) is created. The general procedure for creating a 
conventional CDFG is known to person of ordinary skill in the art and is described in 

20 High-Level Synthesis by Gajski et al. Figure 23 shows certain details of the process of 
creating a CDFG that relate to the delay clause of the present invention. An example 
CDFG is shown in Figure 24. The steps of Figure 23 are performed for each loop in the 
control flow graph. In step 2302, the processor sets a Wait.sub.- count variable and a 
Max.sub.- wait.sub.- count variable in the memory 104 to an initial value of "0". In 

25 step 2304 the processor builds a "loop begin" node in the CDFG and assigns to it a cstep 
attribute value equal to "0". 

Step 2306 is a first step in a loop performed by the processor for each cdb node. 
In step 2308, if the current cdb node is a cnode, control passes to step 2310, which is a 
first step in a loop performed for all data flow nodes associated with the current cdb 
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node. In step 23 12, if a current data flow node is a write operation node having a delay 
clause (i.e., if the current data flow node represents a delayed signal assignment), 
control passes to step 2322. 

In step 2322, a temp.sub.-- wait.sub.~ count variable is set to the current value 
of Wait.sub.- count + a number of delay time units in the delayed sfgnal assignment 
divided by the clock period (e.g., 0+24/6=4). A CDFG node is created and assigned to 
cstep temp.sub.-- wait.sub.- count in step 2324. In step 2326, if temp.sub.- wait.sub.- 
count is greater than Max.sub.-- wait.sub.-- count, then in step 2328, Max.sub.- 
wait.sub.-- count is set equal to temp.sub.-- wait.sub.-- count. Otherwise, control passes 
to step 2342. If, in step 2342, there are more data flow nodes associated with the current 
cdb node, then control passes to step 2310. Otherwise control passes to step 2336. 

If, in step 23 12, the current data flow nodes not a delayed signal assignment, the 
processor builds a standard CDFG node in step 2314 and assigns the created data flow 
node to cstep wait.sub.~ count in step 2316. If, in step 2318, wait.sub.-- count is greater 
than Max.sub.-- wait.sub.~ count, then Max.sub.-- wait.sub.-- count is assigned to 
wait.sub.~ count in step 2320. Control next passes to step 2342. 

If, in step 2308, the current cdb node is not a cnode, then control passes to step 
2330. If in step 2330 the current cdb node is a wait node, then wait.sub.~ count k 
incremented in step 2332 and control passes to step 2336. If, in step 2330, the current 
cdb node is not a wait node, then regular processing is performed to create a CDFG 
node in step 2334 and control passes to step 2336. 

In step 2336, if there are more cdb nodes to process, then control passes to stqi 
2306. Otherwise, a loop.sub.-- latency variable in memory 104 for the loop is assigned 
to Max.sub.— wait.sub.~ coimt and an initiation interval variable for the loop is 
assigned to wait.sub.~ count in step 2338. In step 2340, the processor builds a "loop 
end" node in the CDFG and assigns it to cstep wait.sub.- count. 

The output of step 920 of Figure 7 is input to the scheduler, which uses the 
CDFG and the loop initiation interval and loop latency to schedule the nodes of the 
circuit being generated. In the described embodiment, all nodes except read/write 
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operation nodes can "float", i.e., can be moved between csteps by the scheduler to allow 
the scheduler to create an efficient circuit design. In the CDFG, these nodes are always 
assigned a cstep value equal to the initial cteps in which they appear in the HDL as a 
"suggestion" to the scheduler. It will be understood by persons of ordinary skill in the 
art that the CDFG of Figure 24 has been simplified for the sake of example and that the 
CDFG also includes, e.g., data flow arcs connecting the CDFG nodes that represent data 
flows in a similar manner to the data flows of Figure 21. 

Figure 14 shows an example circuit synthesized from the CDFG of Figure 24. 
Figure 25 shows an example of placement of CDFG nodes in csteps without and with 
use of the delay clause. In the left column, which represents CDFG without the delay 
clause, CDFG nodes corresponding to write operation node 2114, read operation node 
2109, and subtracter node 2110 are assigned to cstep 0. Similarly, CDFG nodes 
corresponding to adder node 2112, read operation node 2104, read operation node 2106, 
assignment node 2116 (and a CDFG loop.sub.-- end node) are assigned to a second 
cstep I. Generation of this CDFG representation causes the synthesizer to generate a 
circuit that has different timing characteristics than the characteristics generated by the 
circuit synthesizer when the source code includes a delay clause. The right column of 
Figure 25 shows the assignment of CDFG nodes to cycles in accordance with the 
present invention. In this example, a write operation node corresponding to write 
operation node 21 14 is moved into cstep 4 during the steps of FIGURE 23. This 
modification of the process to generate the CDFG (possible because of an addition of a 
signal delay attribute to the data graph 2100) allows the synthesis process to generate a 
circuit that has cycle level simulation behavior that is substantially identical to that of 
the cycle level simulation behavior of the source HDL. 

Figure 26 shows an example of loop pipelining when the present invention is 
used. The figure shows an nth iteration of the loop and an n+lst iteration of the loop 
over time. As can be seen in the figure, the initial interval of successive iterations of the 
loop is equal to a number of wait statements (or "posedge" or "negedge" statements). 
The loop latency, is equal to the longest cycle delay from the beginning of the loop to a 
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latest operation. The throughput of the pipelined loop is not decreased by use of delayed 
signal assignments. In general, the scheduler will schedule a circuit having the CDFG of 
Figure 24 as a pipelined circuit because the loop latency is longer than the initiation 
interval. 

In summary, use of delayed signal assignments allows behavioral synthesis to 
infer circuits with pipelined loops which have cycle level simulation behavior which 
matches that of the source HDL. Pipelined loops may include loop carry dependencies 
and/or I/O and/or memory accesses which must be scheduled correctly. The use of a 
placeholder node within a template is an efficient representation of such scheduling 
constraints. 
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WHAT IS CLAIMED IS: 

1 . A method performed by a data processing system having a memory, comprising the 
steps of: 

5 parsing a text description of a circuit, said text description stored in the memory, 

said text description including a loop with a delayed signal assignment having a delay 
value; 

translating said text description into a digital circuit representation in said 
memory, said digital circuit representation including a pipeline; and 
10 setting a latency of said pipeline equal to said delay value. 

2. The method of claim 1, wherein said loop further includes N wait statements, where 
N is greater than zero, said method further comprising the step of setting an initiation 
interval of said pipeline equal to N. 

15 

3. The method of claim I, wherein said text description is written in Verilog and said 
delayed signal assignment uses a Verilog "#" operator. 

4. The method of claim 3, wherein said wait statements use Verilog "@posedge" 
20 statements. 

5. The method of claim 3, wherein said wait statements use Verilog "@negedge'* 
statements. 

25 6. The method of claim 1, wherein said text description is written in VHDL, said 
delayed signal assignment uses a VHDL "after" clause, and said wait statements use 
VHDL "wait" statements. 



7. A method, performed by a data processing system having a memory of building a 
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digital circuit representation including a pipeline in the memory from a textual 
description of a loop, comprising the steps of: 

identifying a loop carry dependency in said loop; 

identifying a producer operation of said loop carry dependency; 

identifying a consumer operation of said loop carry dependency; 

determining a number, n, of cycles within which said producer operation must 
be scheduled after said consumer operation; 

instantiating a placeholder node in said memory; 

node-locking said placeholder node so that it must be scheduled n cycles after 
said consumer operation; and 

constraining said producer operation to be scheduled before said placeholder 

node. 

8. The method of claim 7, wherein the step of node-locking said placeholder node 
further comprises the step of creating a template structure in said memory which 
includes said placeholder node and said consumer operation. 

9. The method of claim 4, 

wherein said producer operation is included in a second template structure in 
said memory, and 

wherein the step of constraining said producer operation fiiither comprises the 
step of constraining said second template structure to be scheduled before said template 
structure. 

10. The method of claim 7, wherein n is equal to an initiation interval of said pipeline 
multiplied by a number of iterations of said loop which execute before data produced by 
said producer is consumed by said consumer. 

11. A method, performed by a data processing system having a memory, of building 
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a digital circuit representation in said memory, said digital circuit representation 
including a pipeline derived from a textual description of a loop, said method 
comprising the steps of: 

identifying an access dependency of said loop; 
5 identifying a first access operation of said access dependency; 

identifying a second access operation of said access dependency; 

determining a number, n, of cycles within which said second access operation 
must be scheduled after said first access operation; 

instantiating a placeholder node in said memory; 
10 node-locking said placeholder node so that it must be scheduled n cycles after 

said first access operation; and 

constraining a scheduling order of said second access operation and said 
placeholder node. 

15 12. The method of claim 11, 

wherein said first access operation is chosen from the group of access operations 
including a memory read, a memory write, a signal write and a port write, 

said second access operation is chosen from the group of access operations 
including a memory read, a memory write, a signal read, a signal write, a port read and 
20 a port write, and 

the step of constraining said scheduling order of said second access operation 
and said placeholder node further includes the step of forcing said second access 
operation to be scheduled before said placeholder node. 

25 13. The method of claim 11, 

wherein said first access operation is chosen from the group of access operations 
including a memory read, a memory write, a signal read, a signal write, a port read and 
a port write, 

said second access operation is chosen from the group of access operations 
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including a memory read, a memory write, a signal write and a port write, and 

the step of constraining said scheduling order of said second access operation 
and said placeholder node further includes the step of forcing said second access 
operation to be scheduled before said placeholder node. 

5 

14. The method of claim 1 1, 

wherein said first access operation is chosen from the group of access operations 
including a signal read and a port read, 

said second access operation is chosen fi-om the group of access operations 
10 including a signal read and a port read, and 

the step of constraining said scheduling order of said second access operation 
and said placeholder node further includes the step of forcing said second access 
operation to be scheduled simultaneous with, or before said placeholder node. 

15 15. The method of claim 1 1 , wherein the step of constraining said scheduling order of 
said second access operation and said placeholder node further includes the step of 
forcing said second access operation to be scheduled before said placeholder node. 

16. The method of claim 11, wherein the step of node-locking said placeholder node 
20 further includes the step of creating a template which includes said placeholder node 

and said first access opentticm. 

17. The method of claim 11, wherein n is equal to an initiation interval of said pipeline 
multiplied by a number of iterations of said loop which execute between said first 

25 access operation and said second access operation. 

18. A system for building, in a memory, a digital circuit representation which 
implements the behavior of a text description in said memory, said system having a 
processor coupled to a memory unit wherein said processor is prognunmed to perform 
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logic processing, said system comprising: 

parsing logic for parsing said text description into a parsed text description, said 
text description including a loop with a delayed signal assignment having a delay value; 

translating logic for translating said parsed text description into said digital 
5 circuit representation, said digital circuit including a pipeline; and 

latency setting logic for setting a latency value of said pipeline to be said delay 
value of said delayed signal assignment. 

19. A system as described in claim 18, v^herein said pipeline implements said loop, 

10 

20. A system as described in claim 19, wherein said loop further includes a number, n, 
of wait statements, said system further comprising initiation interval setting logic for 
setting an initiation interval of said pipeline to be equal to n. 

15 21. A computer program product comprising: 

a computer usable medium having computer readable code embodied therein for 
building a digital circuit representation from a text description of a digital circuit, the 
computer program product comprising: 

computer readable program code devices configured to cause a computer to 
20 effect parsing said text description, said text description including a loop with a delayed 
signal assignment having a delay value; 

computer readable program code devices configured to cause a computer to 
effect translating said text description into said digital circuit representation including a 
pipeline; arid 

25 computer readable program code devices configured to cause a computer to 

effect setting a latency of said pipeline equal to said delay value. 

22. The computer program product of claim 21 wherein said loop further includes N 
wait statements, where N is greater than zero, said computer program product further 
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comprising computer readable program code devices configured to cause a computer to 
effect setting an initiation interval of said pipeline equal to N. 

23. A method performed by a data processing system havine a memory. 
5 comprising the steps of: 

parsing a text description of a circuit, said text description stored in the memory, 
said text description including a loop with N wait statements, where N is greater than 
zero: 

translating said text description into a digital circuit representation in said 
10 memory, said digital circuit representation including a pipeline: and 
setting an initiation interval of said pipeline equal to R 

24. The method of claim 23, wherein the wait statements are VHDL wait 
statements. 

15 

25. The method of claim 23, wherein the wait statements arc Verilog HDL 
(g^osedgc statements^ 

26. The method of claim 23. wherein the wait statements are Verilog HDL 
20 (ohiegedpe statemerits, 

27. A system for building, in a memory, a digital circuit representation 
which imple ments the behavior of a text description in said memory, said system having 
a processor c oupled to a memory unit wherein said processor is programmed to perform 

25 logic processing, said system comprising: 

parsing logic for p arsing said text description into a parsed text description, said 
text description including a loop with N wait statements, where N is greater than zero: 

translating logic for tran slating said parsed text description into said digital 
circuit represent ation, said digital circuit including a pipeline: and 
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initiation internal setting logic for setting an initiation interval of said pipeline 
equal to N, 

28. The system of claim 27. wherein the wait statements are VHDL wait 
5 statements. 

29. The system of claim 27. wherein the wait statements are Verilog HDL 
(Sposedge statements. 

10 30. The system of claim 27, wherein the wait statements are Verilog HDL 

(gjnegedge statements. 

3 L A computer program product comprising a computer usable medium 
having computer readable code embodied therein for building a digital circuit 
15 representation from a text description of a digital circuit, the computer program product 
comprising: 

computer readable program code devices configured to cause a computer to 
effect parsing said text description, said text description including a loop with N wait 
statements, where N is greater than zero: 
20 • computer readable program code devices configured to cause a computer to 

effect translating said text description into said digital circuit representation including a 
pipeline: and 

computer readable program code devices configured to cause a computer to 
effect setting an initiation interval of said pipeline equal to N. 

25 

32. The metho d of claim 3 1 , wherein the wait statements are VHDL wait 
statements. 



33 



The metho d of claim 31 . wherein the wait statements are Verilog HDL 
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@posedge statements. 



34. The method of claim 3 1 , wherein the wait statements are Verilog HDL 
@negedge statements. 
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Abstract 

METHODS FOR AUTOMATICALLY PIPELINING LOOPS 
A method and an apparatus for creating a representation of a circuit with a 
pipeHned loop from an HDL source code description. It infers a circuit including a 
pipelined loop which has cycle level simulation behavior matching that of the source 
HDL. Loop cany dependencies and memory and signal I/O accesses within the loop are 
scheduled correctly. 
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module write4 ( w. x, clock); 

input [15:0] x; 
input clock ; 
output [31:0] w; 
rcg [32:0] w; 
rcg[15:0]xl; 
rcg [15:0] x2 ; 

always begin^^^^^^^^^^ 1530 
forever begin : writcloop - 1530 



xl<=x; 




x2 <= x ; 




1530 



w<=xl *x2; 



end 



end 



endmodule 



Figure 15 
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module after I ( c. x, y. clock); 

input (1.-0] X, y.z; 
input ctodc; 
output (2.*0Jc; 
reg(2.-0Jc: 
regI2:0Jp; 

always begin 

©(pojcdge clock); 

forever begin 

c <= f 24 X - p ; 

®(po$edgc clock): 
©Cposcdgc clock); 

end 

end 
cndmodule 



Figure 19 (a) 

entity after! is 
poft( 

c ; out integer range 0 to 7; 
X, y, z : in integer range 0 to 3; 
clock : in bit 

): 

end after!; 

architecture beliayioraJ of aftcrl is begin 
process 

variable p : integer range 0 to 7* 

begin 

wait unul clock'event and dock b M'; 
loop 

c <= transport X . p after 24 ns; 

wail unUI dock'evcni and clock a *V; 

p:=y + i; 

*ait una! clock'evem and clock o T; 

end loop; 

end process; 
end behavioral; 



Figure 19 (b) 
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Scheduling using Behavioral Templates 



Tai Ly, David Knapp. Ron Miller, Don MacMillen 
Synopsys Inc. 
700B E. Middlcficld Road 
Mountain View, CA USA 94(>43 



Abstract: This paper presents the idea of '^behavioral tern- 
plates" in scheduling* A behavioral template locks several oper* 
ations Into a relative schedule with respect to one another. This 
dmple construct proves powerful in addressing: (1) timing con- 
straints, (2) sequential operation modeling, (3) pre-chaining of 
certain operations, and (4) hierarchical scheduling* We present 
design examples from industry to demonstrate the Importance 
of these Issues In scheduling. 

1.0 Introduction 

The task of scheduling [4] is to sequence nodes in a control and data 
flow graph (CDFG) by assigning each node to a control step 
(cstep). We present the idea of behavioral templates, and describe 
how we use behavioral templates to address several issues that arise 
when applying scheduling to commercial designs. For the purpose 
f this paper, we assume timing constrained scheduling [5]. 

A behavioral template specifies a relative scheduling among its 
member CDFG nodes. It is a template in the sense that its member 
nodes can be treated as a single scheduling unit by assigning the 
starting cstep for the template. It is behavioral in the sense that it 
specifics a scheduling pattern as opposed to, for example, a struc- 
tural pattern [8]. We extend scheduling algorithms to handle behav- 
i ral templates by recasting the task of scheduling as that of 
assigning templates to csteps. 

Although a simple idea, behavioral templates provide a powerful 
way to address four issues in scheduling: 

1. Timing constraints. We use behavioral templates to impose 
fixed and maximum timing constraints. This is more efficient 
than using precedence edges alone because an entire sequence of 
nodes is considered at once when scheduling one template. 

2. Multi<yde operations. To enable scheduling of complex multi- 
cycle operations, we use muldple CDFG nodes locked in a 
behavioral template to model the cycle-by-cycle I/O and resource 
requirements of such operations. 

3. Logic and bit-manlpuhition operations. We use behavioral 
templates to force certain chaining of logic and bit-manipulation 
operations to save register costs. TTiis reduces the scheduling 
design space, and therefore run times. 



4. CDFG hierarchy. We implement hierarchical scheduling by 
inlining each scheduled subgraph, using a behavioral template to 
lock the inlined nodes according to the subgraph's schedule. 

This paper is organized as follows. Section 2 compares this work to 
previous research. Section 3 defines behavioral templates. Secu'on4 
describes extending scheduling for behavioral templates. Section 5 
discusses applicau'ons. Section 6 presents results. Sccdon 7 con- 
cludes this paper: 

2.0 Related Work 

The term •'template" was used in [8] to describe structural patterns 
to exploit regularity. In [9] and (lOJ, such templates arc used t 
guide the clustering of CDFG nodes into super nodes which map to 
"regular^ subcircuits. Both of these works focus on extracting regu- 
lar patterns by pattern matching, whereas our work focuses n how 
to schedule a set of behavioral patterns. Our behavioral templates 
do not represent repeating patterns, but specify local scheduling 
constraints among CDFG nodes. 

Most scheduling systems model multi-cycle operations using single 
CDTO nodes whose delays arc greater than 1. In [7], multi-cycle 
operations are treated as multiple single-cycle operations. Ihis 
turns out to be similar to our template-based nuxlel for sequential 
operations, except that we make deliberate use of cycle-by-cycl 
input/output and resource requirements to model complex opera- 
tions. 

Hierarchical scheduling based on super iKxles are used in [9], [6], 
and [7]. We know of no other system which hierarchically sched- 
ules a design while taking advantage of possible resource sharing 
between rKxles aixl edges in different subgraphs. 

3.0 Behavioral Templates 

We define a behavioral template, T. as a CDFG object which speci- 
fies a set of tuples, (nj.oj), where n^ is a CDFG node and Pi is an 
integer cycle offset. The semantics is that T imposes the constraint: 

schcdule(ni) = scheduled) + Oj for all (n^ . o^) in T 

where schedule(ni) and schedulefT) denotes the schedules for n^ 
and T, respectively. 

That is, if T is scheduled to cstep j, then every member node, n;, of 
T must be scheduled to the cstep, j Oj. This locks all nodes in T 
into a pattern f relative schedules, and we may schedule the entire 
group f nodes by schedulmg the template T Itself! Fig. 1(a) shows 
a template, Tl = { (a,0) (b,l) (c2) (d,3) (cJS) h conlammg 5 nodes. 
All CDPG edges have been mitted for clarity. (In the figures, we 
show behavioral template as a box containing one or more nodes in 
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si ts, Thc^top slot in the box is ffsct 0. the second slot from lop is 
offset I. and so on. F r example, the node "c" in Rg. 1(a) has offset 
5 in TI because it is in the 6th slot from the top of the box.) 

Whenever a node is a member of two r more different templates, 
we can always merge these templates into one. Consider the tem- 
plates Tl and T2 in Fig. 1(a) and 1(b). If node g is to be added to 
template Tl at offset I , then we merge Tl and T2 into T3 of Fic 
1(c). ^' 



a 



21 



23 




(•) (b) 

cdfg behavioral lemplate 

4J) Sdieduling with Behavioral Templates 

Instead of scheduUng individual CDFG nodes, we restate the sched- 
uling problem in terms of behavioral templates. Iniually we create 
one template for every CDFG node, and then meije templates 
v*enever nodes are added to other templates. Thii ensures that 
every CDFG node is a member of one and only one template. TTie 
timing constrained scheduling task is then to schedule all templates 
to minimize resource costs subject to Uming constraints between 
templates. TOs section describes how we extend existing schedul- 
ing algorithms for behavioral templates. 

4.1 Uming Constraints between Templates 

From the CDFG. we constnict a weighted, directed graph G=(V E) 
where V Is the set of aU behavioral templates in the CDFG. and E is 
Xheset of dirwted edges between templates. TTie weight dOV. T ) of 
an edge e(T, , Ty) in E specifies the minimum delay between the 
schedules of T, and Tyi.e.. 

schedulea,) + d(T, . T,) <= scheduleOV) Eq. I 
Tht edges in E are constructed from the data/control dependencies 
betweott member nodes in the templates. For every pair of tern- 
pIates.T.andT^ d(T,.Ty) is the maximum value of 



w(nj,i)j)<t-0|-oj 



Eq.2 



over aU (Oi . Oj) in T, and aU (nj . Oj) in T^ where w(ni . n:) is the 
minimum cycle delay from node a, to node iij. ^""'l"'"* 

Note that Eq. 2 can be negative. Hus means d(T, . TJ can be nega- 
tive, and the graph G is not acyclic If G conudns L.y wfe of^^ 
JVC « unsatilSSS 

for positive cycles, we solve for the afl-pain-IonEest-oaih 
forGusiag..impIeG(N')alg rithm, wS N^^„S of 
templates « a Tl« longest path lengths «e stered in a ^ L> 



f r subsequent incremental update of the as soon as possible 
(ASAP) and as late as possible (ALAP) schedules. 

4.2 ASAP and ALAP Schedules 

At the start of scheduling, we calculate the ASAP and ALAP sched 
ules for all templates in G. to estabUsh the scheduUng time frame 
for each template. Since G may contain negative weighted edges 
we use a relaxation algorithm similar to that in [3] to compute the 
initial ASAP/ALAP schedules: "npuietne 

(1) Propagate along positive edges in E only; 

• for ASAP, propagate forward from the source of CDFG; 

• for ALAP. propagate backward from the sink of CDFG;* 

(2) Relax schedules to saUsfy constraints impUed by negative 
edges in E; 

(3) Repeat step 1 until no more changes in relaxation step. 

When there are no positive cycles in G. tiie above algoritiim is guar- 
anteed to converge in e+I iterations where e is tiie number of nega- 
uve edges in E. The overall computational complexly is OOfe) 
where N is number of templates in G. 

TTie ASAP and AL\P schedules define the initial time frames. Sub- 
sequenUy. as each template is scheduled, we update tiie time frames 
of oUier templates using the longest patii lengtiis matrix. LP: 

scheduleCT,) + LP(Tx . Ty) <= scheduIe(Ty) for all T, . Ty in V 

There is no need for relaxation in tiiis incremental update because 
LP already takes into account all negative edges in E 

43 Cost Functions 

We use a number of iterative/constnictive scheduling algoritiims 
each of which successively picks an unscheduled template and 
schedules it to a cstep in its time frame. TTie algoritiuns differ in 
how tiiey pick the next template to schedule, and in how tiiey pick 
which cstep to schedule tiw template to. We define tiie template pri- 
onty/cost functions in terms of priority/cost functions on tiie CDFG 
nodes. 

For example, in our implementation of list scheduling, tiie templat 
pnority function is defined as the maximum of its member nodes* 
pnority values. TTiis gives priority to tiie template containing Uie 
highest priority nodes. In our implementation of greedy scheduling, 
tiieinaementalcostfunctionforschedulingatemplateTs{(a, )} 
to a cstep j, is defined as tiie sum total of tiie incremental costs'for 
scheduling nodes Uj to cstqisj + og. 

Scheduling/de-$cbeduling moves on templates are implemented as 
moves on tiieir member nodes. AU data stnictures are updated as 
CDFG nodes are scheduled/de-scheduled. In particular; resource 
costs for functional units, registers and interconnects are stiU com- 
puted according to tiie lifetimes and mutual exclusivity of CDFG 
nodes and edges. TTiis approach is easy to implement and leverages 
previous work on scheduling CDPG nodes. 

4.4 Pre<asslgned Operations 

Mowing negative edges in G requires tiiat we extend scheduUng 
algoritiuns to handle maximum timing constraints. This is compU- 
cated by "pre-assigned" operations. i.e.. operations tiiat are 
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assigned to specific resources bef re scheduling. Examples of pre- 
assigned pcratlons are memory read/write operati ns for the same 
RAM. We use a list scheduling algorithm to find an initial legal 
schedule based on source code rdering. However, list scheduling 
can fail to find a legal schedul when there are maximum timing 
constraints. S we augment list scheduling with a recovery step. 
When list scheduUng fails, the recovery step relaxes the template 
schedules that caused scheduling failures, and iterates: 

L List scheduling step: 

Successively consider operations in the ready list in increasing 
source code ordering. For each ready operation, nj, check its tem- 
plate, Xe= {«.(ni . Oi)... ) , for scheduUng in the cstcp s - a,, where s is 
the aurent cstep. Postpone scheduling of if any of the following 
istnsc: 

• Tjj has a "Yelaxed cstcp" (see step 2) which is greater than s - 

• has no "relaxed cslep", but ASAP is greater than s - Oj 

• there is a resource contention if Tjj is scheduled to s - Oj 

When all nodes have been scheduled, exit with success. 

If T, is pos^ned due lo resource contention, and if s - q is greater 
than or equal to the ALAP cstep for Tjt, then list scheduling has 
failed. When this happens, go to step 2 and try to recover. 

2. Recovery step: 

When list scheduling fails to find a legal schedule for T^, we try to 
iBC ver by increasing its ALAP cstep and rerun list scheduling in 
step I. In order to increase the ALAP cstcp for Tj^, we find all 
scheduled templates. Ty for which 

alapCrj = schcdulcCTy) - LP(T^ , Ty) Eq. 3 

where alapCT,) is the ALAP cstep for T^^. For every such template, 

we set its "relaxed cstep" to schedule(Ty) + I. This forces the 
next run of list scheduling to schedule Ty one cstep latrr 

This step exits with failure if any of the following is true: 

• ALAP(TJ is at maximum global cstcp 

• there is no template Ty which satisfies Eq. 3 

• algorithm has iterated for N dmes (N is the # of templates) 




© 
© 



uses 



LP(T2,Tl)eO 



nCURE 2. Enmplc of Ust scheduUng faUure aiHl Kcorcrr M first 

Fig. 2 shows an example of this algorithm at work. In this example. 
CDFG nodes and "c" are pre-assigned to the same resource. 
The source code ordering has V bef re -c" before T'. InitiaUy 
list scheduling in step 1 wiU scfaedulcTl to cstcp 0, and then fails to 



schedule T2 because of res urcec mention at cstep 0, and because 
its ALAP cstcp is 0 nee Tl is scheduled to 0. In step 2, Tl will be 
assigned a relaxed cstep of Mn the next iteraUon, list scheduling 
first schedules T2 to cstep 0, then schedules Tl to cstep 1 to avoid 
resource contention, and finally schedules T3 to cstep 3. 

We have two recourses when the above algorithm fails: First, we 
can continue to try other scheduling algorithms which may still find 
legal schedules. Second, we can insert precedence constraints to 
sequentializc pre-assigned operations by their source code ordering. 
Any unsatisfiable maximum timiqg constraints would then be . 
detected as positive cycles in the graph G. 

5.0 Applications for Behavioral Templates 

This section highlights how we use behavioral templates to advan- 
tage. Fig. 3 shows the overall flow of our scheduling process. 

( extract CDFG ^ 



create templates ^ 
I 

(^Insert constraints^ 



(inlinc subgraphs 

I , 

(prc-chaia nodes ) 

I 

(schedule t^mplatcs^ 




FIGURE 3. Overall flow for hlerarcfatcal scheduling 
5.1 Inserting Timing Constraints 

After extracting the CDFG and creating the initial templates, user- 
specified timing constraints arc added to the CDFG, Minimum tim- 
ing constraints arc represented by precedence edges between nodes, 
but fixed timing constraints and maximum timing constraints are 
represented with the help of behavioral templates. Fixed timing 
constraints arc when two or more operations must be scheduled in a 
fixed number of cycles apart. Tliis is represented by adding ne 
operation to the template of the other operation with the proper off- 
set For example, if ii| must start k csteps after starts, and if is 
in template T= {...(nj . Oj).«}. then wc add Oj to T at offset Oj + k 
(Rg- 4(a)); if rij must start k csteps after n^ ends, and n^ has a delay 
of d cycles, then we add iij lo T at offset Oj -f d - 1 + k (Fig. 4(b)), 

However, if n^ must start k csteps after n^ ends, and does not have 
a static delay (e.g., n; is a subgraph), then we decompose the fixed 
constraint into a k-cycle minimum timing constraint from the end of 
nj to the start of n^. plus a k-cycle maximum timing constraint from 
the start of fij to the end fn^. This is shown in Fig. 4(c). 

Note that in Fig. 4(c), we create a dummy place holdernodc, ph, 
and lock it inatemplate with iij(k cycles apart). This template com- 
bines with the precedence edge f weight 0 from ph to the end fni 
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to represent (he maximum timing constraint from the start of nj to 
the end of nj. The precedence edge f weight k from the end of n; to 
Stan f nj, represents the minimum timing constraint from the end 
fflj to the start of n^. 

In general a maximum timing constraint f k cycles from a set of 
nodes, A, to the set of nodes, B. is rq)rtsented by creating two place 
h Iders, phi and ph2, fixed k cycles apart in a template, and insert- 
ing a 0- weight precedence edge from phi to all nodes in A, and 
inserting a 0-weighl precedence edge from all nodes in B to ph2. 
Fig. 9(a) contains an example of this where the dummy place holder 
nodes, t2 and G, are used to lock a write operation to 0 cycle after 
the end of a loop. 



o^-f k 




FIGURE 4. Timing coostratnts (a) Dj sUrts k cycles after n| starts, (b) il 
starts k cydes after ends and Of has sUtic delay d, (c) Oi starts k cycl3 
after n| ends and nj*s delay Is not static 

5«2 Modeling Multi-cyde Operations 

Behavioral templates also help model complex muld-cyclc opera- 
tions. When a single CDFG node is used to model a multi<yclc 
operation, it imposes some limitations due to CDFG semantics: 

^ Execution cannot start until ALL inputs are available. 

• ALL inputs must be held stable throughout operation execution. 

• ALL outputs are produced in the last cycle of execution. 

This makes it difficult to' model for example, a 3-cycle RAM write 
operation where the address must be stable for the first two cycles, 
the dau must be stable in the second cycle, and the write sequence 
finishes in the third cycle. To model such complex operations, we 
differentiate between combinational and sequential multi-cycle 
operations. A combinational operation has cycle synchronous 
inputs and outputs, so it is modeled by a single CDFG node, A 
sequential op^atioa can have different cycle-by-cycle input/output 
connections and even resource requirements, and is modeled by 
several CDFG nodes that are locked into consecutive csteps by a 
behavioral template. Fig. 5 shows the single-node and multiple- 
nodes-in-a-tcmplate models forJhe above 3-cycle RAM write oper- 
ation. Note that in our model (Rg. 5(b)), the address and data inputs 
arc de-coupled in terms of when and for how long each input must 
be stable. Tliis also de-couples multiple outputs Gf any). 

taV Vaddr rr^j/\^ 
IT dalK.fiwS^^ 



daU^ 



|^deUy=3 



dalK^ 



(a) 



HGURE5. Models for a 3<ycle RAM %irite operation: (a) slnele node 
wlthdeUy«3;(b)3oodeslockcdlnalempUler /•«it«w»c 

This multiple-nodcs-in-a-template model is even more powerful 
wbcn rcsouTDc requirements are added. For example, a pipelined 



operauon uses different pipe-stage in different cycles, allowing 
overlapping pipelined operations to share the same hardware mod- 
ule as long as they do not have resource contend n on any pipe- 
stage. If we view pipe-suges as inieraal resources and assign each 
pipe-stage a named *1oken-, then we may label each node in th 
template model with the resource tokens it requires. As a node is 
scheduled to a cstep, we reserve Its resource tokens for that cstep. 
The number of conflicting tokens (i.e., number of non-mutually- 
exclusivc nodes that require the same token) in any cstep gives the 
number of pipeUncd modules needed in that cstep. Overlapping of 
pipelined operations can be scheduled on the same module because 
successive nodes in the template model require different tokens. 

This removes assumptions about pipeUned operaUons from the 
schcduUng algorithms. We may now model operations on compli- 
cated pipcUnes, and template-based scheduUng will properly sched- 
ule these opcraUons on the pipelined modules. Fig. 6 shows 
examples of operations on pipelines with internal feedback, sequen- 
tial inputs, and muluplc outputs. We use "a(sl]" to denote an pcra- 
tion named "a** which requires the token "sP. 





(a) 



(b) 




nCURE 6. TeinpUte models fon (a) basic 3.sUgc pipelined operation, 
(b) 3-cyde pipcliaed operation with 2 sUges and internal feedback, (c) 
4-cycle pipelined operation with 2 sUges and sequential inputs, (d) 
plpelmed operation using a different internal path and output port- 
Actually, resource tokens need not correspond to physical hardware 
resources, but may be considered a more general mechanism for 
specifying how different types of operations can overlap in time n 
the same module. Consider a 2<ycle RAM which has one read-port 
and one write-port, whose read/write cycles must be synchronized. 
Fig. 7 shows how we use resource tokens to specify this constraint 
to scheduling. If two such operations are pre-assigned to the same 
RAM, then resource contention on any of **sr, **s2" -s3- and "s4" 
implies ah illegal schedule. The token "s3- prevents read operations 
for the same RAM to ovcriap; the token "$4" prevents write opera- 
tions for the same RAM to overlap; and the tokens *'sl" and ^sX* 
prevent read and write operations for the same RAM from being 
scheduled exactly one cycle apart 



raddrV __ 



/ >r 



(•) 



vrl[sl^ 



waddr 



(b) 



FTG URE 7. TempU te Models fo r RAM (a) 2<ycie read, and (b) Z<y de 
write* 

Tb handle muld-port RAM*s, we allow a module to cany more than 
I copy of a given resource token. For example, to model a 4-port 
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RAM where each port can be used for both read or write, we would 
define the RAM module lo have 4 "r/w" tokens, and nnxJel read and 
write pcration on this RAM to require I "r/w" token each. This 
would allow scheduling to perform up to 4 simultaneous read or 
write operations on the same module. 

S3 Pre-Chalning 

Just before scheduling, we selectively force operation chaining by 
locking operations in the same cycle using behavioral templates. 
This •*pre-chaimng" step reduces scheduling complexity at the 
expense of scheduL'ng freedom. User-specified chaining directives 
BTC applied in this step. We also implement automatic prc-chaining 
for logic operations and bit-marupulation operations to save rcgis- 
ten. 

Logic operations include bit-wise AND, OR, NOT. EXOR opera- 
tions, and bit-manipulation operations include bit-extract, bit-con- 
catenate, constant bit/word generator operations. These operations 
are good candidates for pre<haining because they have smaU prop- 
agation delays and they are not resource shared Thus prc-chaining 
can be done on the basis of register costs alone. We implement a 
greedy algorithm for prc-chaining; 

IJn a forward traversal of the data flow graph, pre-chain a logic/bit- 
manipulation operaUon with its predecessors if there are fewer out- 
put bits than input bits; 

2. In a reverse traversal of the data flow graph, pre-chain a logic/bit- 
maiupulation operation with its successors if there are fewer input 
bits than output bits. 

3. Iterate until there are no more changes. 

Fig. 8 shows examples of good pre-chaining configurations. 




W ^ (b) > (c) 

© 

(S)@lo^copcraUoiis 



(d) 

>z} blt-cztrmc( 



nCOTEt Pre<W examples: (a) coostant with successoi; (b) 
•ero^tenslon with tuccessoc; (c) bU-extract with onde^o^ 
«ultW«ip«tIogkwithpr*de«ssi^ 

SA Hierarchical Scheduling 

As dwwn in Kg. 3. our extracted CDFG is hierarchical, in which 
wch fcvcl of the hierarchy corresponds to a loop or a subrouUne. 
Hierarchical sdjeduUng proceeds in a bottom-up traversal of the 
hiaarehy. At each level, instead of representing subgraphs as super 
nodtt. we mline each subgraph and use a behavioral template to 
intcriodc the inlined nodes according to the subgraph's schedule. 

Wining subgraphs allows certain boundary optimizations First 
umaedstAgi^ ns that produce the^out- 

pots) can be deleted. This deletion can recurs to unused subgraph 
inputsandthcntoopenui ns that feed these inputs. Second. inUn- 
wg subgraphs allows scheduUng f neighboring nodes to take 



advantage of when individual subgraph inputs/outputs are actually 
required/produced, whereas representing subgraphs as super nodes 
would force scheduling t assume that all subgraph inputs/outputs 
are required/produced in the same cycles. 

An iher advantage f inUning subgraphs is that scheduling main- 
tains accurate cycle-by-cycle resource costs. This allows, for exam- 
pie. to calculate resource costs of scheduling operations in the first 
and last cycles of a loop subgraph. (When an outside operation is 
scheduled in the first/last cycle of a loop it is performed when enter- 
ing/exiting the loop). 

In fact, hierarchical scheduling is used to implement sequential 
multi-cycle operations. In the initial CDFG, each sequential opera- 
tion is a subroutine call to some library funcu'on. which is a prc- 
scheduled CDFG whose nodes are labelled with the required 
resource tokens. During inlining, each sequential operation is 
replaced by an inlined copy of its function's CDFG. and a template 
is created to lock these inlined nodes. This creates the multiple- 
nodes-in-a-tcmplatc nwdel for sequential operations. 

The disadvantage of inlining subgraphs is that more nodes are 
scheduled instead of a small number of super nodes. This is bal- 
anced somewhat by the fact that inlined nodes are grouped by tern- 
plates into a few scheduling units, so at least the scheduling 
solution space is not much biggec 

6.0 Results 

Behavioral templates have been implemented in the Synopsys 
Behavioral Compiler™ product Behavioral Compiler™ inputs a 
VHDL or Verilog behavioral dcscripuon, performs scheduling, 
allocation, module selection, binding, and control optimization, and 
outputs a RTL design which is then optimized by RTL pptimization 
[2], FSM optimization, and togic synthesis. 

We wiU use •*dft". a discrete fouricr transform design, to iUustrate 
behavioral templates. On reset, dft sequentially reads in the real and 
imaginary paru of the coefficients into arrays cmem and dmem. 
These arrays are mapped to the memory ''CRAKr (cmem in the 
lower bank, dmem in the upper bank). It then enters the main pro- 
cessing loop. In each iteration, dft signals it is ready for processing, 
do a busy wait for the "start" signal, and then scquentiaUy reads in 

the real and imaginary parts ofthe data points into arrays amem and • 
bmem, which are also mapped to two halves of a memory, 
-DRAKT*. It then enters two nested FOR loops which compute the 
discrete fourier transform values and write them out. The memories 
are two-cycle RAM's whose read/write models are shown in Fig. 7. 
The multiply operations are done on a 2-stage pipelined multiply. 





:<s):o: 



(b) 

FIGURE 9. Handshaking for start signal: (a) orlgtnal CDFG with 
timing oonstralntSt (b) final CDFG scheduled 
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Fig. 9 sh ws the CDFG fragment f r the busy-wait on the ''start- 
sigiiaL Rg. 9(a) shows the initial CDFG containing the fixed con- 
strainls that the ''read/' signal be asserted one cycle before the busy 
wait, and dcasserted 0 cycle after the busy waiL Fig. 9(b) shows the 
same CDFG fragment that is finally scheduled. By this time, hierar- 
chical scheduling has already sdieduled the busy wait loop, and the 
loop body is inlined and locked in a template. Also, prt-chaining 
has locked the constants with the write operations. 

However, the main scheduling problem is in the inner-most loop, 
which reads from memories the complex dau (a , j6), and coeffi- 
cient (c , j^, and computes pjum += (a*c - and ipsum +« (a V 
+ b*c). Table 1 shows the scheduling and allocation results for the 
computation part of this loop. Note the pipelined RAM read's are 
chained with the pipelined multiply operations. 
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HGURE 10. ScheduUng/aUocaUoa result for computatloa part of dlt' s 
umcr-ixiost loop 

In Table 2 and 3, we present some design staUslics. #Iine is number 
of VHDlVvcrilog lines in the source. #Ioop is number of loops with 
nesting levels in brackets. #node is number of CDFG nodes. #tem- 
plale is total number of templates scheduled. The ratio of nodes to 
templates are shown in brackets, #RAM is the number of on-chip 
memories used. #gatc is gate count after logic synthesis (excluding 
RAM's). Note the difference between #nodc and ^templates. 

Tkble 2 presents several HLSW benchmarks, modified to use more 
realistic bit-widths. EWF is the fifth order cllipUc wave filter exam- 
pk (20csteps for the main loop, using 1 16x16 pipeUncd mulUply. 
2 32-bit addcn. 10 32-bit registers and 1 1 6-bit register). KF is the 
Kalman filter modified to use S RAM's. 
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Tibk 3 lists several induslrial examples. Compared to benchmark 
examples, these designs tend to: 

• tave more eompUcatcdieset sequences before the main pfo- 
cessing loop, *^ 



• have more cycle-by<ycle Uming constraints on 10 operations 

• have more logic operations and bit-manipulation peraUons 

• use multiple RAM's or multi-port RAM's to improve RAM ' 
access bottlenecks. 

• use pipelined operations to increase throughput 
7.0 Conclusion 

In this paper, we have presented our work on scheduling usine 
behavioral templates. TTie most important value of behavioral tem- 
plates IS that they enable simple solutions to the problems of (1) 
enforcmg fixed and maximum timing constraints. (2) modeline 
complex sequential operations, (3) pre<haining of logic and Wt- 
mampulation operations, and (4) hierarchical scheduling For this 
reason, behavioral templates have been instnimental in our produc- 
UzaUon of behavioral synthesis. 

Future woit will investigate adding structural templates to partition 
the design based on strucniral regularity. 
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Abstract 

This paper describes a HDL synthesis based design methodology that 
supports user adoption of behavioral-level synthesis into normal 
design practices. The use of these techniques increases understanding 
of the HDL descriptions before synthesis, and makes the comparison 
of pre- and post-synthesis design behavior through simulation much 
m re direct This increases user confidence that the specification does 
what the user wants, i.e. that the synthesized design matches the spec- 
ification in the ways that are important to the user. At the same time, 
the methodology gives the user a powerful set of tools to specify com- 
plex interface liming, while preserving a user's ability to delegate 
deasion-making authority to software in those cases where the user 
does not wish to restrict the options available to the synthesis aleo- 
rithms, * 

1.0 Overview 

This paper describes a synthesis methodology that uses high-level 
synthesis (HLS) of behavioral hardware-description language (HDL) 
descriptions. HLS has the distinguishing characteristic that operations 
are automaticaUy sdicduUd, i.e, assigned to states, as opposed to 
lower-level synthesis, in which operations arc assigned to states by the 
user (1, 2, 3]. For example, in an HDL description of a square root 
funcuon, m operand x would be loaded, a series of operations would 
foU w, and a single result r would be returned. The read x and the 
wntc r might be fixed to particular states or times by a communication 
protocol, but tiie mtcmal operations tiiat compute tiie square root 
would be automatically scheduled. 

will then ask a number of questions. Tlicse 
will LTcely include tiie foUowing: 

• Howcanlconstrainl/Ooperationstofallintoparticularcyclcs or 
range of cycles, to meet existing protocols? 

• How <^ I TOnstrain I/O operations to have particular tiini^^ 
UonshiM? For example, how can I constrainVdauSv^SIb^ to 
be synchronous with data on data ports? ^ ^ 

• Howcanlbeconfidcntthatmyintcrfacctimingspedfica^^ 
reaUy works wjtii the surrounding hardware? ^^"^^"^ 

i?oK^^ !^ timing specification is not ri^d? For cxa^?Jc?I 



1. In the sense tiiat it computes tiie right result, 

2. In the sense tiiat scheduling of IAD operations does not 
•break* its I/O protocols. 

These questions can be reformulated as requirements on the HDL 
description metiuxiology to be used in conjunction witii HLS: 

• TTie original HDL description should be simulatable. 

• TTicrc should be a mode wherein tiie cycle by cycle I/O tim- 
ing of the pngrna^ HDL description is preserVed cxaSy- ic 
^cV^J^ISP^ diffcnrnce will^aUowed between"7S^^.kSd 
^n'T'^'f l«C"P?oi«- .™s wiU allow direct corapari- 
son, on a cycle bv cycle basis, of tiic pre- and post-synSi«is 

^Ifnili??" ^ the user to meet tiie mostS 
cycle-based Ummg protocols. ^ 

• T^icrc should be a mode wherein timing relationships 
between I/O signals can be simply and easily preserved 
across syntiicsis, but where 'stretching' (cycle level delav 
insertion) is permitted, so tiiat tiie use? does not have to soec- 
12!^!^?^^^^^,"^^ ^^^^^ * compuution will take. This 
mode should allow manual constraints. Such a mode allows 
companson of jpre- and post-syntiiesis I/O timing between 

similar pomts^of tiie pre- an^ post-syntiiesis wa^S. 

• Tnerc should be a mode in which tiie user expUdtiy specifies 
aU ummg ajnstraints witiwut reference to tiie simulatiwi 

trom the HDL descnpuon are ordenng constraints amone I/O 
k1??{251 P^'^ ™« mode gives tiie greSS fexi" 

?M relauonsliips; it is also tiie mcwtdifficult to use.^ 
We call tficse tiiree modes tiic cycU^fixed 10 schedulingmode. 
tiie superstate-fixed 10 scheduling mode, and tiie fite-fioadng 10 
scheduling mode respectively. Each has consequences for tiic 
style of HDL description and validation metiwdology. These 
inodcs give tiie user a wide range of dwices in specifying I/O 
uming, witii a corresponding range of ways in which validation 
of tiic specification and comparison of tiic implementation witii 
the specification can be performed. 

1 .1 Structure of this paper 

The balance of tiiis paper is structured as follows. In Section 1.2 
related work in tiiis field is discussed. Following tiiat, in Section 
2, some mode-independent considerations and assumptions are 
descnbed. In Section 3. tiie cycle-fixed mode is described in 
detaiL Then in Section 4, tiie superstate-fixed mode is described. 
In Section 5, tiie free-floating mode is described. In Section 6, 
experience witii tiie current software is described; finaUy, in Sec- 
tion 7 tiie paper is sununarized and conclusions are drawn, 

1.2 Related Work 

High-level syntiiesis has been weU described b tiie Uterature; 
sec, for example. C:8mpo$ano[l], Gaj$ki[2], MaeizP]. These 
tutorial papers describe tiie basics of HLS systems. CALLAS (4J 
describes work in tiie area f maintaining simulated behavior tiiat 
is cxactiy tiic same pre- and post-syntiiesis: tills idea is reflected 
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in the cyclc-fixcd mode described here. The superstate-fixed mode 
is related the High Level Suie machine of of [5], and to the behav- 
ioral finite state machines (BFSM's) of [6]. Our approach of valida- 
tion through simulation is typical of current industry practice; it 
complements, but cannot completely replace, more formal methods 

2.0 Basic assumptions 

The circuit to be synthesized by HLS consists of a collection of 
always blocks (VHDL processes); each always block will be 
mapped to hardware consisting of a datapath and a control FSM. 
Each will be synthesized separately. 

Control over tinung makes use of clocking statements in the source 
HDL. In Verilog, this can be done by use of ^(posedge clock) or 
@(negedge clock) statemenlsL These art used to separate I/O 
events that are to happen in different clock cycles. Event triggcn 
using other signals arc specifically disallowed, with the exception 
of asynchronous reset and a special gating methodology described 
in SecUon 12, used for synchronizing I/O. 

2.1 Reset 

In order to handle resets in an inluiu'vcly appealing way, we call 
attention to the always block (VHDL process) that will'be sched- 
uled In our methodology this block contains a single all-encom- 
passing, nonterminaiing loop, here called reset Joop. 

always begin: bl 
begin: resetjoop 

// reset sequence behaviors 
forever begin 

// normal mode behaviors 

end 
, end , 
end 

Inside rtsetjoop is a reset sequencer this consists of all behaviors 
associated with reset. For example, in a microprocessor the reset 
sequence would clear the program counter, disable intemipts, and 
imualizc the stack pointer The reset sequence may contain many 
clock cycles, c.g. to initialize a RAM. Following the reset behavior 
is the 'normal mode' loop, which does not terminate dthcr, this 
loop contains behaviors that are executed until the next reset 
occurs. In a microprocessor, for example, the normal mode loop 
would be the fetch / execute cycle. 

In order to simulate the effect of synchronous resets correcay in the 
source HDL description, the user must insert a statement of the 
fonn^ 

if (reset = Tbl) disable rcset^Ioop; 
after every ^posedge statement This disable has the effect of 
restarting the block (process) foUowing a clock edge upon whidi 
-reset is found to be tnie. Simulatioa of synchronous resets can be 
matched both pre- and post-syiithesis. 

Another capabiUty can also be provided in which the user declares a 
reset pm to the synthesis software, which then synthesizes the reset: 
but because the reset behavior is not encoded in the HDL. resets 
cannot be simulated correcUy before synthesis using this technique. 
Scheduling cannot handle exits triggered by a reset in the same way 
as ther exits, because there may be read-before-write accesses in 



2. In VHDL this would be -Vben reset =T exit rese^loop-. 



the HDL. Consider the following: In this siniation, the assignmenu 
begin: resetjoop, 

purport ^. x;7/ x is read before write! 
begin:jn^ruloop ^ wi uc . 

@(posedge clock); 
(re^t = Tbl) disable resetjoop; 

entf^"^' 
end 

of X cannot be rescheduled, because this would change the observ- 
able behavior of the circuit immediately following a reset pulse If 
for example, the second write to x^zs rescheduled before th clock 
edge, then the output immediately following a reset pulse would be 
v2 m the scheduled design; but it would be v7 in the original 
descriptioa So if we are to allow read before write in the HDL, we 
must either relax the requirement that all behaviors must be identi- 
cal. or we must forbid movement of such side effects across clock 
boundaries. Side effects on variables that are always written before 
they are read arc not affected. 

2.2 Registered outputs 

VHDL signals and Verilog reg variables behave Uke register or 
latch outputs. TTiat is, they hold their values once set For imple- 
mentau'on reasons, we chose to register all outputs of HLS synthe- 
sized designs; thus a nonblocking (signal) assignment becomes a 
register write. This has the consequence that responses to external 
events cannot happen until the cycle after the external event, as 
shown in Fig. 1. 

Figure 1 shows the behavior of a synthesized circuit where the HDL 
input is of the general form 

If (Ready = Tbl) then Data <= foo; 

@(posedge clock); 

This tinung corresponds to both input and output. Notice that this 
Qock 
Ready, 
Data 

Fig. 1. Response to an external event 
Uming diagram impUes that the control FSM for the synthesized 

data path is a Mealy machine; and that the overall synthesized 
design is a Moore mart^^y ^ 

Here is an example combining an asynchronous reset and a com- 
pact busy wait on a data strobe. 

while (strobe != 1) begin 

^(posedge clock or posedge reset); 
^ If (reset =rbl) disable rcse^loop; 

3.0 Cycle-fixed mode 

High-level synthesis in cycU^fixed mode can be described by the 
following statement 

• Cycle-by-cyclc I/O timing is identical between the pre-and 

post-synthesis designs. 
This means that validauon by simulation is straightforward: a user 
ticed merely simulate the pre- and post-synthesis designs side by 
side, and check for differences in the outputs. Alternatively, the 
synthesized design can be inserted into the riginal test bench with- 
out nwdifying the test bench. The only differences that arc visible 
involve combinational delays in the form of setup and hold times; 
for example, ft deUa-delay setup time would become a real setup * 




Atty. Docket No. 4000/10 
Inventor: Tai A. Ly et al. 
Title: METHODS FOR ATUOMATICALLY 
PIPELINING LOOPS 



Jonathan T. Kaplan, Esq. 
Registration No. 38,935 
Brown Raysman Millstein Felder & Steiner LLP 
Attorney for Applicants 
120 West Forty-Fifth Street ' 
New York, New York 10036 
Phone: (212) 944-1515 Fax: (212) 840-2429 



APPENDIX B 
Sheet 3 of 7 



time, and a registered output pin will not transition exactly on the 
clock edge, as it would in the pre-synthesis simulation'. This is 
sh wn in Fig. 2. 

Notice that this mode only constrains the I/O operations of the 
design. That is. the reads and iK>nblocking (signal) writes of the 
HDL are tied to particular cycles. But this still leaves optimiiation 
opportunities for the scheduling algorithm: other operations (e.g. 
additions, memory operations, and register reads and writes) can be 
shifted in time, as long as they consume data after it has been read 
in, and produce data in time to write it ouL The I/O operations pro- 
vide a series of 'stakes in the ground* that define time frames within 
which all other operations are free to move. 

Qock 



Strobe 




Data 



Fig. 2a. Simulation of specified design (pre-synthests) 



Clock 




Fig. 2b. Simulation of synthesized design (post-synthesis) 

Fig, 2. Comparison of simulation in cycle-fixed mode. 

The main advantage of cycle-fixed mode is that the user can synthe- 
size cxaaly the same timing diagram that the original HDL spedfi- 
cation shows in simulation; thus, if the simulated HDL specification 
worics in a particular context, then the synthesized design will also 
work, assuming only that setup, hold, and propagation delays, etc. 
as shown in Fig. lb meet the clock cycle time. 
A further advantage of cycle-fixed mode is that simulation of a 
zcro-gate-delay model of the synthesized design will match the 
original specification exactly; hence a simple file difference pro- 
gram can be used to compare pre- and post-synthesis designs. This 
is expected to Rave a profound effect on user acceptance of HLS as 
a viable tool in the design cycle: users arc able to simply and cflS- 
dently check the equivalence of designs before and after synthesis. 
Thae arc a number of methodological and implementation consid- 
erations that affect the way we can write and implement cycle-fixed 
mode. These will now be described. 

3.1 Numbers of clock edges 

One consequence of the commitment to maintain exact I/O equiva- 
lence in cycle-fixed nwde is that numbers of clock edges cannot be 
varied inside the scope of loops and conditionals. To do so would 
distort the I/O timing of the design. 



I. In zero-delay simulation one should ensurc that data transitions 
occur slighUy after clock transitions; failing to do this is the most 
conunon source of simulation mismatches. The problem comes 
about because of varying numbers f$imulation<ycle delays in the 
dock and data wires of the circuit: the clock can arrive 'after' the 
data by an infinitesimal (zero-time) aroounL This causes something 
analogous to a setup-time violation. 



3.2 Loop boundaries 

Every loop of an always block must contain at least one clock edge 
statement The only excepUon to this is loops with constant itera- 
tion bounds, which can be unrolled during synthesis. 
A loop can be thought of as a subgraph of a finite-state machine 
(FSM) which forms a cycle. The synthesized design will enter this 
cycle when the loop is executed, and leav it when the loop is 
exited. Such a loop is shown in Rg. 3. 
ol <= vl* 

whlle^c) begin: loop 
g^j@(posedge clock); 

o3<=v3; 
@(p<wedge clock); 

!c/vl.v3 




Fig. 3. Loop and corresponding slate graph 



The loop of Fig. 3 corresponds to the state labeled 'Loop'. During 
each pass of the loop, the value of v2 will be written to the uiput 
porto. 

The main consequence of matching this behavior is the splitting f 
the conditional test c. Notice that it was necessary, in order to cap- 
ture the timing of the original, to have a stale transition that 
bypassed the loop altogether if c was false when it was first tested. 
This means that the test must be performed in two places: once in 
state p/rv, and once in state hop. In general, it is necessary to unroll 
the first state of the first pass through a while loop in order to cap- 
ture this behavior correctly. 

If we wish to avoid unrolling the first pass, then it is necessary to 
rewrite the loop so that (I) there is a clock edge on all paths 
between the writes of ol and o3, and (2) there is a clock edge 
between the conditional test and any succeeding I/O, as shown in 
Fig. 4. 

ol <= vl' 

while (c) begin: loop 

@(posedge clock); 

^o2<=v2; 
end 

@(posedge clock); 
o3 <= v3; 




Fig. 4. Loop that does not need partial unrolling. 

3.3 Conditional multicycle operations 

A multicycle operation is one that has a longer combinational delay 
than the clock cycle. This imposes spedal constraints on synthesis 
in cycle-fixed mode, because it is necessary to stabilize all data and 
control inputs to Che hardware block that implements the muldcycle 
operation. This includes all the control inputs of all multiplexers 
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that drive multicycle operations; clearly we cannot afford glitches 
on these paths. 

But inserting these registers means that we need to know what to 
strobe into the registers nc cycle bef re the multicycle peration 
is to begin. Thus we need to add extra time, under some circum- 
stances, so that the stabilizing registers can be properly loaded. 
This is illustrated in Fig. 5; we assume 

@(posedge clock): 

If (inpul.signal = Tbl) begia 
X = inpui.rcadl; 
y = mpui.rcadl2; 
tmp = X + y; // 2 cycle addiUon 
@(pos€dge clock); // strobe sub rcgs 
€»(posedge clock); // 1st cycle of add 
^(posedge clock); // 2nd cycle of add 
cut <= tmp; 

end 

@(pasedge clock); 
Fig. 5. HDL description for a multicycle addiu'on. 

N ticc that we needed three clock cycles to do this properly: one 
to get the condition and strobe the stabilizing registers, and two to 
perform the mulu'cyclc addition. Notice also that such delays can 
often be hidden, where the multicycle operations arc not con- 
strained by I/O; but that in this case there is no opportunity to 
hide the additional delay associated with stabilizing the inputs. 

3.4 Loop pipelining in cycle-fixed mode 

Loop pipelining is a technique whereby a loop can be made to act 
like a pipeline. Thus the loop has a relatively long latency. i.e, the 
time from a data input to the corresponding data output; and a 
shorter initiation interval, which is the rate at which data can be 
delivered to and read out from the loop. In cycle-fixed mode, and 
with some extra constraints in the other modes, a simple way to 
imply loop pipelining while maintaining timing equivalence is to 
use a delayed assignment (in VHDL, a transport delay) on the 
\sspat statement Suppose, for example, we have a loop whose 
latoicy is ten cycles, but whose initiation interval is two cycles; 
w can put an output write after the second clock edge statement, 
with a delay of eight cycles. This wiU simulate the same way both 
before and after synthesis. 

while (condition) begin 



f (pasedge dock); // 10 ns clock 



end 



@(po5edgedock) 

out <= #80 value; // delayed by 8 cycles 



many times, with varying numbers f clock edge sutemcnts each 
time, looking for the best implemenuti n. 



Request- 



Strobe 
Data 



Fig. 6. Tkvo-wire handshaking protocol. 

The superstate-fixed I/O scheduling mode can be expressed by the fol- 
lowing statements: 

• Adjacent pairs of clock edge statements in the HDL form the 
boundaries of superstates. 

• All I/O operations in a superstate remain in that superstate, 

• A superstate may be expanded by the scheduler, which can add 
clock cycles to lengthen a superstate. 

• All I/O writes in a superstate will always take place in the last 
clock cycle of the superstate. 

• I/O reads may float within a superstate. 




ssl ss2 ss3 
Fig. 7a. Simulation before superstate-fixed scheduling. 




Rg. 7b. Simulation after superstate-fixed scheduling. 
These rules, taken together, mean that an HDL scheduled in superstate 
mode will show the same signal transitions and ordering as the origi- 
nal HDL; but that the original timing may potentially be 'stretched' by 
the addition of new clock edges. This is illustrated in Fig. 7. where the 
original HDL simulation of an I/O transfer taking three cycles has 
become five cycles long by the addition of two extra cycles to the sec- 
ond superstate. 



4.0 Superstate-fixed Mode 



Tltcsiqferstate'fixedVO mode is used where the I/O should 
inherit its general structure ftom the HDL, but where there is 
some ftcedom to shift I/O operations in time. Considei; for exam- 
ple, the two- wire handshaking protocol shown in Rg. 6. 
The two-wire protocol is insensitive to the time between transi- 
ti ns; this makes it ideal for many appUcations, In a case Uke this, 
the nly things we reaUy need to assure in order to have correct* 
timing are that (1) the signal transitions occur in the right order 
and (2) that the transitions of Strobe and Daux maintain a lockst^o 
relationship. Beyond that, the user might not care very much how 
txmy dock cycitt were inserted by scheduling; other design opti- 

I ini2alioncntcna($uchasthenumber fgates to compute the data 

value)nughi dictate more r fewer clock cycles f r this transao- 

I tioa.pe cycle-fixed mode is unsuitabl f r this kind of loosened 

speaficauonoftiming: the user could bef rced to edit the code 



4.1 Protocols In superstate mode 

One of the major advantages of superstate mode is that handshaking If 
O protocob are not distorted by the addition of clock cycles to super- 
states. This has two beneficial conseqences: first, comparison of simu- 
lated pec- and post-synthesis designs is strai^tforward; and second, 
protocols that are insensitive to increased numbers of clock cycles will 
not be 'broken* by superstate scheduling. Hence if a design consists of 
many processes, each of which is to be scheduled, the use of hand- 
shaking communication in conjunction with superstate mode schedul- 
ing will ensure that the design will continue to woric after synthesis. 
The same consideraUons apply to the simulation test bench as well: 
the test bench must communicate with the synthesized design(s) via 
handshaking protocols; otherwise it may have to be modified to com- 
municate successfiilly with th synthesized design. This happens 
because the read and write operations occur at different times pre- and 
post-synthesis; the test bench must be able to tolerate this, or th user 
will have to retime the test bench. 

Protocols that do not involve explicit requests and acknowledges can 
still be tised; but care must be taken with data to be read in by the syn- 
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thcsized process. In particular, recall that read operations may move 
freely within their supersute. This means that dau being presented 
to the synthesized circuit must be cither valid during the cndrc 
superstate in which it is read, or else retimed after scheduling. This 
will ensure that the read operation always geu the correct data. 

AJZ Constraints in superstate mode 

The reason a designer would use superstate mode instead of cycle- 
fixed mode is that some part of the schedule does not have a fixed 
timing bound, and the user does not want to imply such a bound by 
using cycle-fixed I/O. However, the user may have a non-handshak- 
ing protocol, or a protocol that streams data once synchronization 
has been established by the protocol In such cases the paru of the 
schedule that perform synchrom'zation may need to be handled as if 
the scheduler was in cycle-fixed mode; while the other parts of the 
design can be allowed more freedom. For example, consider the 
fragment 

whflc (ready = 1 'bO) begin: handshaking_loop 
@(posedge clock); 

end 

^(posedge clock); 

al = in Dort; // label read 1 

@(posedge clock); 

a2 = in Dort; // label read_2 

©(posedge clock); 

out_port<= long_invoIved_function(al, a2); 
out.rcady <= Tbl; //label done 
@(posedge clock); 

Here the external logic provides the data for read^J and readj^ in 
the two cycles after the signal ready goes true; the synthesized sys- 
tem must pick it up then, or the protocol will be broken. Further- 
more, insertion of extra cycles in the loop handshakingjoop will 
cause the interface to behave unpredictably. Thus cycle-fixed mode 
would seem to be indicated. However, suppose that there is no need 
for the output to show up until 20 cycles after the input has been 
delivered; the designer will thus want to allow the scheduler author- 
ity to add cycles to the last superstate, and rely on a test of the om_- 
vtady pin to synchronize the data on out _port. Thus stretching can 
be allowed in the last superstate, but not in the first three. 
Tliis can be done by means of explicit point-to-point scheduling 
constraiirts; that is, constraints that tie two labeled operations 
together in a particular timing relationship. A constraint set that 
would serve the purpose is 

1. The time from the beginning of handshaking Joop to its end 
should be exactly one cycle. 

2. Tiicimc Uointhc end of handshaking Joop to ththegi 
nadj, should be exactly one cycle, 

3. The time from the end of tou£r/uii(ing ioop to the data ready 
strobe done is no greater than 21 cycte. 

Notice that these constraints are not part of the HDL; but they are a 
necessary part of the methodology. They can be implemented as 
pscudo<omroents, as attributes, or as directives in a separate sched- 
uler command file. Notice also that they can be appUed to non-I/O 
operaUons as weU, in all three modes, to give the user a litUe extra 
control over the scheduling process. 

4*3 Superstate HDL methodology 

Superstate mode defines supersutes as containing the I/O opera- 

tions that faU between adjacent pairs of clock edge statements This 
definition has the consequence that sometimes an HDL prepar^ for 

supcKtote mode needs clock edge statements that are not needed in 
cycle-fixed mod^ For example, the text of Rg, 3 is ambiguous 
when the HDL is considered as input for superstate mode, TTiis 



^mes about because two writes are separated by a conditional 
Jposedge. If the loop condiUon is true, then theVrites shoJSt in 
different superstates; if it is false, then they should be in the ime 
supentate. Clearly there is no unique staUc assignment o?JoT' 
auons to superstates in this situation. ^ 
Birthermorc, there is an implicit ordering of operations conferred 
by the sequencing of the HDL text; this ordering cannot be auSl^ 
to come into conflict with the ordering conferred by the migration 
of reads into any cycle of their superstate and writ« into the last 
cycle of their supersute. 

•Hie HDL methodology niles that prevent ambiguities and contra- 
dicuons in supersute mode are: 

1. A supersute that contains a loop conunue is called a continuing 
superstate. ImpUddy. the last supersute of a loop is also a con- 
tinumg supersute. A continuing supersute and the first super- 
state of the loop are really the same superstate; there is no clock 
statement on the execution path going from one to the other: If a 
conunuuig superstate contains a write, then the first state f the 
loop cannot contain any I/O, because a write belonging to the 
conanumg superstate would be migrated to the end of the first 
loop superstate: this would result in a violaUon of the HDLs 
ordering constraints. 

2. A superstate that conuins a loop beginning cannot include both 
an I/O wntc before the loop beginning and any I/O opcrati n 
inside the loop. For example, 

@(posedge clock); 

out Dort<= write 1 daU; 

while (cond) begiiT 

readl_dau = in^port; //Illegal! 

@(posedgc clock); 

end 

the write in this fragment conflicts with the read in the begin- 
ning of the loop; they are in the same superstate. 

3. A write cannot precede a while loop that is succeed by any 1/ 
O operaaon, unless there is a clock edge sutement between 
cither the wiTtc and the loop begin, or between the loop end and 
the second I/O operation 

4. A loop having a superstate in which both a loop exit* and an I/O 
wntc arc located must have a clock edge sutement between the 
loop end and the next I/O operation. 

5. A conditioiial clock edge (e.g. an ©edge on one branch fa 
conditionaI)cannot be used to separate a write from another I/O 
operauon. This fragment is illegal for that reason. 

outj)ort<=vl; 

If (cond) @(posedge clock); 

v2 = in_port; 

5.0 Free-floating I/O mode 

It wiU sometimes be the case that a user will need to convey more 
freedom to the scheduler than is allowed by the superstate I/O 
mode. For example, the user may wish to allow two unrelated 
writes to be permuted. Consider the fragment of Rg. 8. 
In this situaUon, the user might not care whether the first or the sec- 
ond function happens first; indeed, they could be interleaved and 
the user might not care. But neither superstate nor cycle-fixed mode 
will permit permutation of I/O operaUons and waits; so a more 
powerful RKxle is needed. 



1. Other than a reset exit Reset exits can be ignored after a preprx>. 
ccssing step in v^ch they are detected and global reset behavior is 
enacted, as explained in Section 2. 
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Thtfwjhating mode is characterized by implicit constraints on 
al = in_portI: 
a2 = in Dort2; 
@(posedge clock); 

out_port.l <= long^functi n 1 ( al, a2 ); 
(§(posedge clock); 
bl =in_port3; 
b2 s= in D0rt4; 
@(posedge clock); 

out_port_2 <= Iong:.function_2 ( bl, b2 ); 
Fig. 8. Writes to outjwrt.l and out j)ort.2 may be permuted, 
single I/O ports and explicit user constraints. 
ImpUdl I/O port constraints arc derived directly from the HDL text 
and arc imposed on the sets of reads and writes that occur on a sin- 
gle port These arc foraied into partiaUy ordered sets, one for each 
port, where the ordering is derived from a static execution trace 
analysis of the source HDL. The schedule constructed by synthesis 
can only transpose two members of one of these sets if there is no 
ordering relationship between them. 

TOs, however, says nothing about ordering of reads and writes that 
occur on different ports, which must be explicitly constrained by 
the user, by means of the explicit two-point constraints described in 
Seaion4.2. 

For example, in our experience a common early mistake in free- 
floating mode is to expect a data strobe's timing to be fixed with 
respect to that of the data being strobed. This will not necessarily be 
the case if the user does not issue explicit constraints. 
The downside of this mode is the number of expUcit constraints that 
tte user must construct This can easily be comparable in numbers 
of lines to the HDL input itself. In addition, it is very easy to get 
such constraints wrong, or to forget a crucial constraint; hence the 
cycle-fixed and superstate modes are simpler and less error-prone to 
use. 

6.0 Experience 



Support for the methodologies discussed above has been built into a 
commerdal product the Synopsys Behavioral CompilerfTM) This 
product 1$ currenUy in use at a number of sites. Of these, about half 
use Vfenlog as their input HDU the rest use VKK^ 
Experience to dale indicates that the superstate mode is usually the 
most convem'eni from the standpoint of case of specificati<m «f 
cornplcx timing behaviors. The next most convenient is usually the 
qrcle-fixcd mode. The reason forUds is diat the powerof die fiw- 
floatmg mode comes at the price of manuaOy added constrainte- 
while the <7cle-fixed mode requires the user to add clock cycles to 

From the staridpoint of ease of validation of results, the cycle-fixed 
™>de IS us^ a little more convenient than the super^me mode, 
pus IS be^ the handshaking protocols necc«^^ 

to the test bench after superstatcmode sch«luling 
be designed and written in both the test bench and the spc^h^ 

^^Zl^Ti''%1'' "^"^^ ^ modified^ 

match the schedule of I/O of the post-synthesis design. • 
One area in which the free-floating mode seems to be more conve- 
nient ^ the other, is in that of exploradon. Here S^^^i^rre 

or algorithm, than m getUng its interfaces cxaoly S In tto Si- 
textthecase f taming the design around and tiluK^ 
freedom frt)mmethoA^^^^ nukes itsiSpter^ 

ariL Then when the general outfines of the algorithms. repS 



tions, etc. are clear the user can begin to worry about the detailed 1/ 
O uming. " 

The vcrall effort f getUng I/O interfaces right using these three 
modes IS usually less than the effort spent in getUng the best possi- 
ble quality f results. Even with behavioral synthesis. HDL writing 
styles stjll can have a large impaa n the quality of the synthesized 
circuit Examples that can affect synthesis quaUty are: loop order- 
ing, assignment of variables and arrays to memories, choice of loop 
pipehne imuauon intervals and latencies. pipeUned components 
embedding combinational logic in reusable function blocks the' 
tradeoff between mulUcycle operations and fast clock rates 'and the 
partiuomng of the design into datapath/controller subum'ts 'fi e. 
always blocks; in VHDL, processes). All are potentially of great 
importance to the quality of results, and all represent tnie ciTginecr- 
mg deosions that must be carefiifly considered if a reafly eood 
design is to be achieved. 

7.0 Conciusion 

We have presented HDL methodologies for the synthesis of various 
kinds of I/O timing and protocols, and for simulation-based valida- 
uon of the synthesized design against the original specificati n 
Three modes of scheduling I/O operations have been presentect 

1 . Cycle-fixed, in which the design has exacUy the same cycle- 
level I/O timing before and after synthesis; 

2. Su^tatc-fixed. in which I/O operations are grouped by pairs 
of ©poscdge statements; post-synthesis timing behavior isa 
(potenually) stretched version of the pre-synthesis timing; and 

4. hrcc-floatmg, m which the only constraints on I/O scheduling 
arecithcr between operations sharing a port or suppUed by the 

Some of the implications of the scheduOng modes were described 
In the cycle-fixed and superstate modes, these involve the place- 
ment of clock edge statements, loop boundaries, conditionals, and U 
O operations; while in the free-floating mode there are no rules of 
this kind. 

Experience with production software which implements these 
methodologies has been described, and conclusions based on that 
expenence have been drawn. 

8.0 References 

1. R. Camposano, W. Wolf. Trends in Hiyh.T^vi^i ^y rMh..u y]^^ 
wcr, 19». 

2. D. Gajski. N Dutt. A, Wu. S. Utt High-Levet 5!vn>h.^ , >. j pTT^ 
, gwion to Chin and Sv.t^ rv^ffn"g{:!:i>^{^^^""^'^ '"^^ 

n >^ High Uvd SvnthfMS. In The Synthesis Appfvach to 
Digual System Design, P. Michel, U. Uuther, P. DiSr. cds 
Chapter6.Kluwer. 1992. ' 

^" ^.f'?'?!!!!!^!^.^"^' Hieh^Uvel Svnthr^i. fmm VHm. u^v,, 
exact T i ming Cgnstrgim s. Proceedings of the 29th ACM/IEEE 
Design Automation Conference, pp. 188-193, IEEE, 199Z 

5. R.A. Bagamaschi,AKuehlmann.S.M,Wa,V.Venkataraman, 

U^of'^S^hT;^ 

Of High Uvc ! Synto^ . Workshop proceedings, Suth 
Interna^ 

6. W.W)lf.S.Tak^h C- Y.Huang, R.Mam^^ 

fi^"i>y.il'!??JiXM^^'^''"' ^"^^^^^^^ ^V^^^ni . Proceedings of 
t^-f f^fA ^^^^^ ^^o"^^on Coi^etence, pp. 182- 

157, IrrE , 1992. 

7. K.L McMillan, Rtting Formal Metfiodsliuo the Design Cycle. 
Proceedings of the SlstAOMEEE Design Automation Cotter- 
we. pp. 314-319. IEEE, 1994. ««wn««,/i to,yer 



^ Atty. Docket No. 4000/10 ^ 

Inventor: Tai A. Ly et al. 
Title: METHODS FOR ATUOMATICALLY 
PIPELINING LOOPS 



Jonathan T. Kaplan, Esq. 
Registration No. 38,935 
Brown Raysman Millstein Felder & Steiner LLP 
Attorney for Applicants 
120 West Forty-Fifth Street 
New York, New York 10036 
Phone: (2 1 2) 944- 1 5 1 5 Fax: (2 1 2) 840-2429 



APPENDIX B 
Sheet 7 of 7 



PATENT 

In re Patent No. 5,764,95 1 to LY et ai 
Application Serial No. 09/590,584 
Attorney Docket 06816.0010 



EXHIBIT 3 
TO 

DECLARATION OF JONATHAN T. KAPLAN 
AND STATEMENT OF FACTS IN SUPPORT OF FILING 
ON BEHALF OF NON-SIGNING INVENTOR 

Pursuant to 37 CFR 1.47 



Declaration of Jonathan T. Kaplan 6 



1 



V 

t. 



Kapleui. 000829 



Tue Aug\f^08:37:56 2000 




August 29, 2000 

San Jose, California 



Jonathan T. Kaplan, Esq. 
Brown, Raysman, Millstein, 

Felder & Steiner LLP 
120 W Forty-Fifth St 
New York, NY 10036 

Dear Mr. Kaplan: 

This responds to your letter of July 31, 2000, your file reference 
number 4000/10, in which you requested my signature on a reissue 
of US patent number 5764951. 

I must inform you that I do not believe that the extension 

is novel; and further, I dp not believe that it is what we invented. 

I must therefore decline to sign the application. 

Sincerely, 




David W. Knapp 
CTO 

Get2Chip . com, Inc . 
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